Karim shoair commited on
Commit ·
2a85f06
1
Parent(s): 0832de7
First version of Scrapling full documentation
Browse files- docs/Core/using scrapling custom types.md +0 -21
- docs/Examples/selectorless_stackoverflow.py +0 -25
- docs/Extending Scrapling/writing storage system.md +0 -17
- docs/api-reference/adaptor.md +20 -0
- docs/api-reference/custom-types.md +21 -0
- docs/api-reference/fetchers.md +25 -0
- docs/benchmarks.md +44 -0
- docs/contributing.md +102 -0
- docs/development/automatch_storage_system.md +66 -0
- docs/development/scrapling_custom_types.md +21 -0
- docs/donate.md +27 -0
- docs/fetching/choosing.md +77 -0
- docs/fetching/dynamic.md +248 -0
- docs/fetching/static.md +300 -0
- docs/fetching/stealthy.md +218 -0
- docs/index.md +107 -2
- docs/overview.md +328 -0
- docs/parsing/automatch.md +220 -0
- docs/parsing/main_classes.md +539 -0
- docs/parsing/selection.md +512 -0
- docs/stylesheets/extra.css +3 -0
- docs/tutorials/migrating_from_beautifulsoup.md +98 -0
- docs/tutorials/replacing_ai.md +1 -0
- mkdocs.yml +142 -0
docs/Core/using scrapling custom types.md
DELETED
|
@@ -1,21 +0,0 @@
|
|
| 1 |
-
> You can take advantage from the custom-made types for Scrapling and use it outside the library if you want. It's better than copying their code after all :)
|
| 2 |
-
|
| 3 |
-
### All current types can be imported alone like below
|
| 4 |
-
```python
|
| 5 |
-
>>> from scrapling.core.custom_types import TextHandler, AttributesHandler
|
| 6 |
-
|
| 7 |
-
>>> somestring = TextHandler('{}')
|
| 8 |
-
>>> somestring.json()
|
| 9 |
-
'{}'
|
| 10 |
-
>>> somedict_1 = AttributesHandler({'a': 1})
|
| 11 |
-
>>> somedict_2 = AttributesHandler(a=1)
|
| 12 |
-
```
|
| 13 |
-
|
| 14 |
-
Note `TextHandler` is a sub-class of Python's `str` so all normal operations/methods that work with Python strings will work.
|
| 15 |
-
If you want to check for the type in your code, it's better to depend on Python built-in function `issubclass`.
|
| 16 |
-
|
| 17 |
-
The class `AttributesHandler` is a sub-class of `collections.abc.Mapping` so it's immutable (read-only) and all operations are inherited from it. The data passed can be accessed later though the `._data` method but careful it's of type `types.MappingProxyType` so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds).
|
| 18 |
-
|
| 19 |
-
So basically to make it simple to you if you are new to Python, the same operations and methods from Python standard `dict` type will all work with class `AttributesHandler` except the ones that try to modify the actual data.
|
| 20 |
-
|
| 21 |
-
If you want to modify the data inside `AttributesHandler`, you have to convert it to dictionary first like with using the `dict` function and modify it outside.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Examples/selectorless_stackoverflow.py
DELETED
|
@@ -1,25 +0,0 @@
|
|
| 1 |
-
"""
|
| 2 |
-
I only made this example to show how Scrapling features can be used to scrape a website without writing any selector
|
| 3 |
-
so this script doesn't depend on the website structure.
|
| 4 |
-
"""
|
| 5 |
-
|
| 6 |
-
import requests
|
| 7 |
-
|
| 8 |
-
from scrapling import Adaptor
|
| 9 |
-
|
| 10 |
-
response = requests.get('https://stackoverflow.com/questions/tagged/web-scraping?sort=MostVotes&filters=NoAcceptedAnswer&edited=true&pagesize=50&page=2')
|
| 11 |
-
page = Adaptor(response.text, url=response.url)
|
| 12 |
-
# First we will extract the first question title and its author based on the text content
|
| 13 |
-
first_question_title = page.find_by_text('Run Selenium Python Script on Remote Server')
|
| 14 |
-
first_question_author = page.find_by_text('Ryan')
|
| 15 |
-
# because this page changes a lot
|
| 16 |
-
if first_question_title and first_question_author:
|
| 17 |
-
# If you want you can extract other questions tags like below
|
| 18 |
-
first_question = first_question_title.find_ancestor(
|
| 19 |
-
lambda ancestor: ancestor.attrib.get('id') and 'question-summary' in ancestor.attrib.get('id')
|
| 20 |
-
)
|
| 21 |
-
rest_of_questions = first_question.find_similar()
|
| 22 |
-
# But since nothing to rely on to extract other titles/authors from these elements without CSS/XPath selectors due to the website nature
|
| 23 |
-
# We will get all the rest of the titles/authors in the page depending on the first title and the first author we got above as a starting point
|
| 24 |
-
for i, (title, author) in enumerate(zip(first_question_title.find_similar(), first_question_author.find_similar()), start=1):
|
| 25 |
-
print(i, title.text, author.text)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/Extending Scrapling/writing storage system.md
DELETED
|
@@ -1,17 +0,0 @@
|
|
| 1 |
-
Scrapling by default is using SQLite but in case you want to write your storage system to store elements properties there for the auto-matching, this tutorial got you covered.
|
| 2 |
-
|
| 3 |
-
You might want to use FireBase for example and share the database between multiple spiders on different machines, it's a great idea to use an online database like that because this way the spiders will share with each others.
|
| 4 |
-
|
| 5 |
-
So first to make your storage class work, it must do the big 3:
|
| 6 |
-
1. Inherit from the abstract class `scrapling.storage_adaptors.StorageSystemMixin` and accept a string argument which will be the `url` argument to maintain the library logic.
|
| 7 |
-
2. Use the decorator `functools.lru_cache` on top of the class itself to follow the Singleton design pattern as other classes.
|
| 8 |
-
3. Implement methods `save` and `retrieve`, as you see from the type hints:
|
| 9 |
-
- The method `save` returns nothing and will get two arguments from the library
|
| 10 |
-
* The first one is of type `lxml.html.HtmlElement` which is the element itself, ofc. It must be converted to dictionary using the function `scrapling.utils._StorageTools.element_to_dict` so we keep the same format then saved to your database as you wish.
|
| 11 |
-
* The second one is string which is the identifier used for retrieval. The combination of this identifier and the `url` argument from initialization must be unique for each row or the auto-match will be messed up.
|
| 12 |
-
- The method `retrieve` takes a string which is the identifier, using it with the `url` passed on initialization the element's dictionary is retrieved from the database and returned if it exist otherwise it returns `None`
|
| 13 |
-
> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py) file
|
| 14 |
-
|
| 15 |
-
If your class satisfy this, the rest is easy. If you are planning to use the library in a threaded application, make sure that your class supports it. The default used class is thread-safe.
|
| 16 |
-
|
| 17 |
-
There are some helper functions added to the abstract class if you want to use it. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/storage_adaptors.py), it's heavily commented :)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/api-reference/adaptor.md
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Adaptor Class
|
| 2 |
+
|
| 3 |
+
The `Adaptor` class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities.
|
| 4 |
+
|
| 5 |
+
Here's the reference information for the `Adaptor` class, with all its parameters, attributes, and methods.
|
| 6 |
+
|
| 7 |
+
You can import the `Adaptor` class directly from `scrapling`:
|
| 8 |
+
|
| 9 |
+
```python
|
| 10 |
+
from scrapling.parser import Adaptor
|
| 11 |
+
```
|
| 12 |
+
|
| 13 |
+
## ::: scrapling.parser.Adaptor
|
| 14 |
+
handler: python
|
| 15 |
+
:docstring:
|
| 16 |
+
|
| 17 |
+
## ::: scrapling.parser.Adaptors
|
| 18 |
+
handler: python
|
| 19 |
+
:docstring:
|
| 20 |
+
|
docs/api-reference/custom-types.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Custom Types API Reference
|
| 2 |
+
|
| 3 |
+
Here's the reference information for all custom types of classes Scrapling implemented, with all their parameters, attributes, and methods.
|
| 4 |
+
|
| 5 |
+
You can import all of them directly like below:
|
| 6 |
+
|
| 7 |
+
```python
|
| 8 |
+
from scrapling.core.custom_types import TextHandler, TextHandlers, AttributesHandler
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
## ::: scrapling.core.custom_types.TextHandler
|
| 12 |
+
handler: python
|
| 13 |
+
:docstring:
|
| 14 |
+
|
| 15 |
+
## ::: scrapling.core.custom_types.TextHandlers
|
| 16 |
+
handler: python
|
| 17 |
+
:docstring:
|
| 18 |
+
|
| 19 |
+
## ::: scrapling.core.custom_types.AttributesHandler
|
| 20 |
+
handler: python
|
| 21 |
+
:docstring:
|
docs/api-reference/fetchers.md
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Fetchers Classes
|
| 2 |
+
|
| 3 |
+
Here's the reference information for all fetcher-type classes' parameters, attributes, and methods.
|
| 4 |
+
|
| 5 |
+
You can import all of them directly like below:
|
| 6 |
+
|
| 7 |
+
```python
|
| 8 |
+
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
|
| 9 |
+
```
|
| 10 |
+
|
| 11 |
+
## ::: scrapling.fetchers.Fetcher
|
| 12 |
+
handler: python
|
| 13 |
+
:docstring:
|
| 14 |
+
|
| 15 |
+
## ::: scrapling.fetchers.AsyncFetcher
|
| 16 |
+
handler: python
|
| 17 |
+
:docstring:
|
| 18 |
+
|
| 19 |
+
## ::: scrapling.fetchers.PlayWrightFetcher
|
| 20 |
+
handler: python
|
| 21 |
+
:docstring:
|
| 22 |
+
|
| 23 |
+
## ::: scrapling.fetchers.StealthyFetcher
|
| 24 |
+
handler: python
|
| 25 |
+
:docstring:
|
docs/benchmarks.md
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
|
| 2 |
+
|
| 3 |
+
Here are benchmarks comparing Scrapling's parsing speed to popular Python libraries in two tests.
|
| 4 |
+
|
| 5 |
+
### Text Extraction Speed Test
|
| 6 |
+
|
| 7 |
+
This test consists of extracting the text content of 5000 nested div elements.
|
| 8 |
+
|
| 9 |
+
Here are the results comparing Scrapling to all well-known parsing libraries:
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
| # | Library | Time (ms) | vs Scrapling |
|
| 13 |
+
|---|:-----------------:|:---------:|:------------:|
|
| 14 |
+
| 1 | Scrapling | 5.44 | 1.0x |
|
| 15 |
+
| 2 | Parsel/Scrapy | 5.53 | 1.017x |
|
| 16 |
+
| 3 | Raw Lxml | 6.76 | 1.243x |
|
| 17 |
+
| 4 | PyQuery | 21.96 | 4.037x |
|
| 18 |
+
| 5 | Selectolax | 67.12 | 12.338x |
|
| 19 |
+
| 6 | BS4 with Lxml | 1307.03 | 240.263x |
|
| 20 |
+
| 7 | MechanicalSoup | 1322.64 | 243.132x |
|
| 21 |
+
| 8 | BS4 with html5lib | 3373.75 | 620.175x |
|
| 22 |
+
|
| 23 |
+
As you see, Scrapling is on par with Scrapy and slightly faster than Lxml, which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml, but Scrapling is four times faster.
|
| 24 |
+
|
| 25 |
+
### Extraction By Text Speed Test
|
| 26 |
+
|
| 27 |
+
Scrapling can find elements based on its text content and find elements similar to these elements. The only known library with these two features, too, is AutoScraper.
|
| 28 |
+
|
| 29 |
+
So, we compared this to see how fast Scrapling can be in these two tasks compared to AutoScraper.
|
| 30 |
+
|
| 31 |
+
Here are the results:
|
| 32 |
+
|
| 33 |
+
| Library | Time (ms) | vs Scrapling |
|
| 34 |
+
|-------------|:---------:|:------------:|
|
| 35 |
+
| Scrapling | 2.51 | 1.0x |
|
| 36 |
+
| AutoScraper | 11.41 | 4.546x |
|
| 37 |
+
|
| 38 |
+
Scrapling can find elements with more methods and returns the entire element's `Adaptor` object, not only text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them.
|
| 39 |
+
|
| 40 |
+
As you see, Scrapling is still 4.5 times faster at the same task.
|
| 41 |
+
|
| 42 |
+
If we made Scrapling extract the elements only without stopping to extract each element's text, we would get speed twice as fast as this, but as I said, to make it fair comparison a bit :smile:
|
| 43 |
+
|
| 44 |
+
> All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.
|
docs/contributing.md
ADDED
|
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Thank you for your interest in contributing to Scrapling!
|
| 2 |
+
|
| 3 |
+
Everybody is invited and welcome to contribute to Scrapling.
|
| 4 |
+
|
| 5 |
+
Smaller changes have a better chance of getting included in a timely manner. Adding unit tests for new features or test cases for bugs you've fixed helps us to ensure that the Pull Request (PR) is acceptable.
|
| 6 |
+
|
| 7 |
+
There is a lot to do...
|
| 8 |
+
|
| 9 |
+
- If you are not a developer, you can help us improve the documentation.
|
| 10 |
+
- If you are a developer, most of the features I'm planning to add in the future are moved to [roadmap file](https://github.com/D4Vinci/Scrapling/blob/main/ROADMAP.md), so consider reading it.
|
| 11 |
+
|
| 12 |
+
## Running tests
|
| 13 |
+
Scrapling includes a comprehensive test suite that can be executed with pytest, but first, you need to install all libraries and `pytest-plugins` inside `tests/requirements.txt`. Then, running the tests will result in an output like this:
|
| 14 |
+
```bash
|
| 15 |
+
$ pytest tests
|
| 16 |
+
=============================== test session starts ===============================
|
| 17 |
+
platform darwin -- Python 3.12.8, pytest-8.3.3, pluggy-1.5.0 -- /Users/<redacted>/.venv/bin/python3.12
|
| 18 |
+
cachedir: .pytest_cache
|
| 19 |
+
rootdir: /Users/<redacted>/scrapling
|
| 20 |
+
configfile: pytest.ini
|
| 21 |
+
plugins: cov-5.0.0, asyncio-0.25.0, base-url-2.1.0, httpbin-2.1.0, playwright-0.5.2, anyio-4.6.2.post1, xdist-3.6.1, typeguard-4.3.0
|
| 22 |
+
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=function
|
| 23 |
+
collected 83 items
|
| 24 |
+
|
| 25 |
+
...<shortened>...
|
| 26 |
+
|
| 27 |
+
=============================== 83 passed in 157.52s (0:02:37) =====================
|
| 28 |
+
```
|
| 29 |
+
Hence, you can add `-n auto` to the command above to run tests in threads to increase speed.
|
| 30 |
+
|
| 31 |
+
Bonus: You can also see the test coverage with the pytest plugin below
|
| 32 |
+
```bash
|
| 33 |
+
pytest --cov=scrapling tests/
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
## Installing the latest unstable version from the dev branch
|
| 37 |
+
```bash
|
| 38 |
+
pip3 install git+https://github.com/D4Vinci/Scrapling.git@dev
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
## Development
|
| 42 |
+
Setting the scrapling logging level to `debug` makes it easier to know what's happening in the background.
|
| 43 |
+
```python
|
| 44 |
+
>>> import logging
|
| 45 |
+
>>> logging.getLogger("scrapling").setLevel(logging.DEBUG)
|
| 46 |
+
```
|
| 47 |
+
### Code Style
|
| 48 |
+
|
| 49 |
+
We use:
|
| 50 |
+
|
| 51 |
+
1. Type hints for better code clarity
|
| 52 |
+
2. Flake8, bandit, isort, and other hooks through `pre-commit`. <br/>Please install the hooks before committing with:
|
| 53 |
+
```bash
|
| 54 |
+
pip install pre-commit
|
| 55 |
+
pre-commit install
|
| 56 |
+
```
|
| 57 |
+
It will run automatically on the code you push with each commit.
|
| 58 |
+
3. Conventional commit messages format. We use the below format for commit messages
|
| 59 |
+
|
| 60 |
+
| Prefix | When to use it |
|
| 61 |
+
|-------------|--------------------------|
|
| 62 |
+
| `feat:` | New feature added |
|
| 63 |
+
| `fix:` | Bug fix |
|
| 64 |
+
| `docs:` | Documentation change/add |
|
| 65 |
+
| `test:` | Tests |
|
| 66 |
+
| `refactor:` | Code refactoring |
|
| 67 |
+
| `chore:` | Maintenance tasks |
|
| 68 |
+
|
| 69 |
+
Example:
|
| 70 |
+
```
|
| 71 |
+
feat: add auto-matching for similar elements
|
| 72 |
+
|
| 73 |
+
- Added find_similar() method
|
| 74 |
+
- Implemented pattern matching
|
| 75 |
+
- Added tests and documentation
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Push changes to the library
|
| 79 |
+
|
| 80 |
+
Then, the process is straightforward.
|
| 81 |
+
|
| 82 |
+
- Read [How to get faster PR reviews](https://github.com/kubernetes/community/blob/master/contributors/guide/pull-requests.md#best-practices-for-faster-reviews) by Kubernetes (but skip step 0 and 1)
|
| 83 |
+
- Fork Scrapling [Git repository](https://github.com/D4Vinci/Scrapling.git).
|
| 84 |
+
- Make your changes, and don't forget to create a separate virtual environment for this project.
|
| 85 |
+
- Ensure all tests are passing.
|
| 86 |
+
- Create a Pull Request against the [**dev**](https://github.com/D4Vinci/Scrapling/tree/dev) branch of Scrapling.
|
| 87 |
+
|
| 88 |
+
A bonus: if you have more than one version of Python installed, you can use tox to run tests on each version with:
|
| 89 |
+
```bash
|
| 90 |
+
pip install tox
|
| 91 |
+
tox
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
> Note: All tests are automatically run with each push on Github on all supported Python versions using tox, so ensure all tests pass, or your PR will not be accepted.
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
## Building Documentation
|
| 98 |
+
```bash
|
| 99 |
+
pip install mkdocs-material
|
| 100 |
+
mkdocs serve # Local preview
|
| 101 |
+
mkdocs build # Build the static site
|
| 102 |
+
```
|
docs/development/automatch_storage_system.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Scrapling uses SQLite by default, but this tutorial covers writing your storage system to store element properties there for auto-matching.
|
| 2 |
+
|
| 3 |
+
You might want to use FireBase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because the spiders will share with each other.
|
| 4 |
+
|
| 5 |
+
So first, to make your storage class work, it must do the big 3:
|
| 6 |
+
|
| 7 |
+
1. Inherit from the abstract class `scrapling.core.storage_adaptors.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic.
|
| 8 |
+
2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes.
|
| 9 |
+
3. Implement methods `save` and `retrieve`, as you see from the type hints:
|
| 10 |
+
- The method `save` returns nothing and will get two arguments from the library
|
| 11 |
+
* The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the function `element_to_dict` in submodule `scrapling.core.utils._StorageTools` to keep the same format and save it to your database as you wish.
|
| 12 |
+
* The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the auto-match will be messed up.
|
| 13 |
+
- The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`.
|
| 14 |
+
|
| 15 |
+
> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage_adaptors.py) file
|
| 16 |
+
|
| 17 |
+
If your class meets these criteria, the rest is easy. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe.
|
| 18 |
+
|
| 19 |
+
Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage_adaptors.py); it's heavily commented :)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
## Real-World Example: Redis Storage
|
| 23 |
+
|
| 24 |
+
Here's a more practical example generated by AI using Redis:
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
import redis
|
| 28 |
+
import orjson
|
| 29 |
+
from functools import lru_cache
|
| 30 |
+
from scrapling.core.storage_adaptors import StorageSystemMixin
|
| 31 |
+
from scrapling.core.utils import _StorageTools
|
| 32 |
+
|
| 33 |
+
@lru_cache(None)
|
| 34 |
+
class RedisStorage(StorageSystemMixin):
|
| 35 |
+
def __init__(self, host='localhost', port=6379, db=0, url=None):
|
| 36 |
+
super().__init__(url)
|
| 37 |
+
self.redis = redis.Redis(
|
| 38 |
+
host=host,
|
| 39 |
+
port=port,
|
| 40 |
+
db=db,
|
| 41 |
+
decode_responses=False
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
def save(self, element, identifier: str) -> None:
|
| 45 |
+
# Convert element to dictionary
|
| 46 |
+
element_dict = _StorageTools.element_to_dict(element)
|
| 47 |
+
|
| 48 |
+
# Create key
|
| 49 |
+
key = f"scrapling:{self._get_base_url()}:{identifier}"
|
| 50 |
+
|
| 51 |
+
# Store as JSON
|
| 52 |
+
self.redis.set(
|
| 53 |
+
key,
|
| 54 |
+
orjson.dumps(element_dict)
|
| 55 |
+
)
|
| 56 |
+
|
| 57 |
+
def retrieve(self, identifier: str) -> dict:
|
| 58 |
+
# Get data
|
| 59 |
+
key = f"scrapling:{self._get_base_url()}:{identifier}"
|
| 60 |
+
data = self.redis.get(key)
|
| 61 |
+
|
| 62 |
+
# Parse JSON if exists
|
| 63 |
+
if data:
|
| 64 |
+
return orjson.loads(data)
|
| 65 |
+
return None
|
| 66 |
+
```
|
docs/development/scrapling_custom_types.md
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
> You can take advantage of the custom-made types for Scrapling and use them outside the library if you want. It's better than copying their code, after all :)
|
| 2 |
+
|
| 3 |
+
### All current types can be imported alone like below
|
| 4 |
+
```python
|
| 5 |
+
>>> from scrapling.core.custom_types import TextHandler, AttributesHandler
|
| 6 |
+
|
| 7 |
+
>>> somestring = TextHandler('{}')
|
| 8 |
+
>>> somestring.json()
|
| 9 |
+
'{}'
|
| 10 |
+
>>> somedict_1 = AttributesHandler({'a': 1})
|
| 11 |
+
>>> somedict_2 = AttributesHandler(a=1)
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
Note that `TextHandler` is a subclass of Python's `str`, so all normal operations/methods that work with Python strings will work.
|
| 15 |
+
If you want to check for the type in your code, it's better to depend on Python's built-in function `issubclass`.
|
| 16 |
+
|
| 17 |
+
The class `AttributesHandler` is a subclass of `collections.abc.Mapping`, so it's immutable (read-only), and all operations are inherited from it. The data passed can be accessed later through the `_data` property, but be careful; it's of type `types.MappingProxyType`, so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds).
|
| 18 |
+
|
| 19 |
+
So, to make it simple for you if you are new to Python, the same operations and methods from the Python standard `dict` type will all work with class `AttributesHandler` except the ones that try to modify the actual data.
|
| 20 |
+
|
| 21 |
+
If you want to modify the data inside `AttributesHandler,` you have to convert it to a dictionary first, like using the `dict` function, and then modify it outside.
|
docs/donate.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
I've been working on Scrapling and other public projects in my spare time and have invested considerable resources and effort to provide these projects for free to the community. By becoming a sponsor, you'd be directly funding my coffee reserves, helping me continuously update existing projects and potentially create new ones.
|
| 2 |
+
|
| 3 |
+
You can sponsor me directly through [Github sponsors program](https://github.com/sponsors/D4Vinci) or [Buy Me A Coffe](https://buymeacoffee.com/d4vinci). If you are a **company** and looking to **advertise** your business through Scrapling or another project, check out the available plans on my [Github Sponsors page](https://github.com/sponsors/D4Vinci).
|
| 4 |
+
|
| 5 |
+
Below is the list of our Gold tier sponsors.
|
| 6 |
+
|
| 7 |
+
Thank you, stay curious, and hack the planet! ❤️
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Top Sponsors
|
| 12 |
+
### Scrapeless
|
| 13 |
+
|
| 14 |
+
[Scrapeless Deep SerpApi](https://www.scrapeless.com/en/product/deep-serp-api?utm_source=website&utm_medium=ads&utm_campaign=scraping&utm_term=d4vinci) From $0.10 per 1,000 queries with a 1-2 second response time!
|
| 15 |
+
|
| 16 |
+
[](https://www.scrapeless.com/?utm_source=github&utm_medium=ads&utm_campaign=scraping&utm_term=D4Vinci)
|
| 17 |
+
|
| 18 |
+
Deep SerpApi is a dedicated search engine designed for large language models (LLMs) and AI agents. It aims to provide real-time, accurate, and unbiased information to help AI applications retrieve and process data efficiently.
|
| 19 |
+
|
| 20 |
+
- covering 20+ Google SERP scenarios and mainstream search engines.
|
| 21 |
+
- support real-time data updates to ensure real-time and accurate information.
|
| 22 |
+
- It can integrate information from all available online channels and search engines.
|
| 23 |
+
- Deep SerpApi will simplify the process of integrating dynamic web information into AI solutions, and ultimately achieve an ALL-in-One API for one-click search and extraction of web data.
|
| 24 |
+
- **Developer Support Program**: Integrate Scrapeless Deep SerpApi into your AI tools, applications or projects. [We already support Dify, and will soon support frameworks such as Langchain, Langflow, FlowiseAI]. Then share your results on GitHub or social media, and you will get a 1-12 month free developer support opportunity, up to 500 free usage per month.
|
| 25 |
+
- 🚀 **Scraping API**: Effortless and highly customizable data extraction with a single API call, providing structured data from any website.
|
| 26 |
+
- ⚡ **Scraping Browser**: AI-powered and LLM-driven, it simulates human-like behavior with genuine fingerprints and headless browser support, ensuring seamless, block-free scraping.
|
| 27 |
+
- 🌐 **Proxies**: Use high-quality, rotating proxies to scrape top platforms like Amazon, Shopee, and more, with global coverage in 195+ countries.
|
docs/fetching/choosing.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Introduction
|
| 2 |
+
Fetchers are classes that can do requests or fetch pages for you easily in a single-line fashion with many features and then return a [Response](#response-object) object.
|
| 3 |
+
|
| 4 |
+
This feature was introduced because the only option before v0.2 was to fetch the page as you wanted, then pass it manually to the `Adaptor` class and start playing with it.
|
| 5 |
+
|
| 6 |
+
> Fetchers are not wrappers built on top of other libraries, but they use these libraries as an engine to make requests/fetch pages easily for you while fully utilizing that engine and adding features for you that aren't included in those engines
|
| 7 |
+
|
| 8 |
+
## Fetchers Overview
|
| 9 |
+
|
| 10 |
+
Scrapling provides three different fetcher classes, each designed for specific use cases.
|
| 11 |
+
|
| 12 |
+
The following table compares them and can be quickly used for guidance.
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
| Feature | Fetcher | PlayWrightFetcher | StealthyFetcher |
|
| 16 |
+
|--------------------|----------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
|
| 17 |
+
| Relative speed | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
|
| 18 |
+
| Stealth | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
|
| 19 |
+
| Anti-Bot options | ⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
|
| 20 |
+
| JavaScript loading | ❌ | ✅ | ✅ |
|
| 21 |
+
| Memory Usage | ⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
|
| 22 |
+
| Best used for | Basic scraping | - Dynamically loaded websites <br/>- Small automation<br/>- Slight protections | - Dynamically loaded websites <br/>- Small automation <br/>- Complicated protections |
|
| 23 |
+
| Browser(s) | ❌ | Chromium and Google Chrome | Modified Firefox |
|
| 24 |
+
| Browser API used | ❌ | PlayWright | PlayWright |
|
| 25 |
+
| Setup Complexity | Simple | Simple | Simple |
|
| 26 |
+
|
| 27 |
+
In the following pages, we will talk about each one in detail.
|
| 28 |
+
|
| 29 |
+
## Parser configuration in all fetchers
|
| 30 |
+
All fetchers classes share the same import, as you will see in the upcoming pages
|
| 31 |
+
```python
|
| 32 |
+
>>> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
|
| 33 |
+
```
|
| 34 |
+
Then you use it right away without initializing like this, and it will use the default parser settings:
|
| 35 |
+
```python
|
| 36 |
+
>>> page = StealthyFetcher.fetch('https://example.com')
|
| 37 |
+
```
|
| 38 |
+
If you want to configure the parser ([Adaptor class](../parsing/main_classes.md#adaptor)) that will be used on the response before returning it for you, then do this first:
|
| 39 |
+
```python
|
| 40 |
+
>>> from scrapling.fetchers import Fetcher
|
| 41 |
+
>>> Fetcher.configure(auto_match=True, encoding="utf8", keep_comments=False, keep_cdata=False) # and the rest
|
| 42 |
+
```
|
| 43 |
+
or
|
| 44 |
+
```python
|
| 45 |
+
>>> from scrapling.fetchers import Fetcher
|
| 46 |
+
>>> Fetcher.auto_match=True
|
| 47 |
+
>>> Fetcher.encoding="utf8"
|
| 48 |
+
>>> Fetcher.keep_comments=False
|
| 49 |
+
>>> Fetcher.keep_cdata=False # and the rest
|
| 50 |
+
```
|
| 51 |
+
Then, continue your code as usual.
|
| 52 |
+
|
| 53 |
+
The available configuration arguments are: `auto_match`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the `Adaptor` class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
|
| 54 |
+
|
| 55 |
+
> Note: The `auto_match` argument is disabled by default; you must enable it to use that feature.
|
| 56 |
+
|
| 57 |
+
### Set parser config per request
|
| 58 |
+
As you probably understood, the logic above for setting the parser config will work globally for all requests/fetches done through that class, and it's intended for simplicity.
|
| 59 |
+
|
| 60 |
+
If your use case requires a different configuration for each request/fetch, you can pass a dictionary to the request method (`fetch`/`get`/`post`/...) to an argument named `custom_config`.
|
| 61 |
+
|
| 62 |
+
## Response Object
|
| 63 |
+
The `Response` object is the same as the [Adaptor](../parsing/main_classes.md#adaptor) class, but it has added details about the response like response headers, status, cookies, etc... as shown below:
|
| 64 |
+
```python
|
| 65 |
+
>>> from scrapling.fetchers import Fetcher
|
| 66 |
+
>>> page = Fetcher.get('https://example.com')
|
| 67 |
+
|
| 68 |
+
>>> page.status # HTTP status code
|
| 69 |
+
>>> page.reason # Status message
|
| 70 |
+
>>> page.cookies # Response cookies as a dictionary
|
| 71 |
+
>>> page.headers # Response headers
|
| 72 |
+
>>> page.request_headers # Request headers
|
| 73 |
+
>>> page.history # Response history of redirections, if any
|
| 74 |
+
>>> page.body # Raw response body
|
| 75 |
+
>>> page.encoding # Response encoding
|
| 76 |
+
```
|
| 77 |
+
All fetchers return the `Response` object.
|
docs/fetching/dynamic.md
ADDED
|
@@ -0,0 +1,248 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Introduction
|
| 2 |
+
|
| 3 |
+
Here, we will discuss the `PlayWrightFetcher` class. This class provides flexible browser automation with multiple configuration options and some stealth capabilities. It uses [PlayWright](https://playwright.dev/python/docs/intro) as an engine for fetching websites.
|
| 4 |
+
|
| 5 |
+
As we will explain later, to automate the page, you need some knowledge of [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
+
|
| 7 |
+
## Basic Usage
|
| 8 |
+
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 9 |
+
|
| 10 |
+
```python
|
| 11 |
+
>>> from scrapling.fetchers import PlayWrightFetcher
|
| 12 |
+
```
|
| 13 |
+
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 14 |
+
|
| 15 |
+
Now we will go over most of the arguments one by one with examples if you want to jump to a table of all arguments for quick reference [click here](#full-list-of-arguments)
|
| 16 |
+
|
| 17 |
+
> Notes:
|
| 18 |
+
>
|
| 19 |
+
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (waits for the `domcontentloaded` state).
|
| 20 |
+
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
This fetcher currently provides 4 main run options, but they can be mixed as you want.
|
| 24 |
+
|
| 25 |
+
Which are:
|
| 26 |
+
|
| 27 |
+
### 1. Vanilla Playwright
|
| 28 |
+
```python
|
| 29 |
+
PlayWrightFetcher.fetch('https://example.com')
|
| 30 |
+
```
|
| 31 |
+
Using it like that will open a Chromium browser and fetch the page. There are no tricks or extra features; it's just a plain PlayWright API.
|
| 32 |
+
|
| 33 |
+
### 2. Stealth Mode
|
| 34 |
+
```python
|
| 35 |
+
PlayWrightFetcher.fetch('https://example.com', stealth=True)
|
| 36 |
+
```
|
| 37 |
+
It's the same as the vanilla PlayWright option, but it provides a simple stealth mode suitable for websites with a small-to-medium protection layer(s).
|
| 38 |
+
|
| 39 |
+
Some of the things this fetcher's stealth mode does include:
|
| 40 |
+
|
| 41 |
+
* Patching the CDP runtime fingerprint.
|
| 42 |
+
* Mimics some of the real browsers' properties by injecting several JS files and using custom options.
|
| 43 |
+
* Custom flags are used on launch to hide Playwright even more and make it faster.
|
| 44 |
+
* Generates real browser headers of the same type and user OS, then append them to the request's headers.
|
| 45 |
+
|
| 46 |
+
### 3. Real Chrome
|
| 47 |
+
```python
|
| 48 |
+
PlayWrightFetcher.fetch('https://example.com', real_chrome=True)
|
| 49 |
+
```
|
| 50 |
+
If you have a Google Chrome browser installed, use this option. It's the same as the first option but will use the Google Chrome browser you installed on your device instead of Chromium.
|
| 51 |
+
|
| 52 |
+
This will make your requests look more like requests coming from an actual human, so it's less detectable, and you can even use the `stealth=True` mode with it for better results like below:
|
| 53 |
+
```python
|
| 54 |
+
PlayWrightFetcher.fetch('https://example.com', real_chrome=True, stealth=True)
|
| 55 |
+
```
|
| 56 |
+
If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
|
| 57 |
+
```commandline
|
| 58 |
+
playwright install chrome
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### 4. CDP Connection
|
| 62 |
+
```python
|
| 63 |
+
PlayWrightFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
| 64 |
+
```
|
| 65 |
+
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 66 |
+
|
| 67 |
+
This fetcher takes it even a step further. You can use [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option like below
|
| 68 |
+
```python
|
| 69 |
+
PlayWrightFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222', nstbrowser_mode=True)
|
| 70 |
+
```
|
| 71 |
+
There's also a `nstbrowser_config` argument to send the config you want to send with the requests to the NSTBrowser. If you leave it empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config.
|
| 72 |
+
|
| 73 |
+
## Full list of arguments
|
| 74 |
+
Scrapling provides many options with this fetcher, which works in all modes except the [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5) mode. To make it as simple as possible, we will list the options here and give examples of using most of them.
|
| 75 |
+
|
| 76 |
+
| Argument | Description | Optional |
|
| 77 |
+
|:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 78 |
+
| url | Target url | ❌ |
|
| 79 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
|
| 80 |
+
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 81 |
+
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
|
| 82 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 83 |
+
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30000. | ✔️ |
|
| 84 |
+
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 85 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation, then returns `page` again. | ✔️ |
|
| 86 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 87 |
+
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 88 |
+
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 89 |
+
| extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | ✔️ |
|
| 90 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
|
| 91 |
+
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 92 |
+
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 93 |
+
| stealth | Enables stealth mode; you should always check the documentation to see what stealth mode does currently. | ✔️ |
|
| 94 |
+
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser and use it. | ✔️ |
|
| 95 |
+
| locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
|
| 96 |
+
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | ✔️ |
|
| 97 |
+
| nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | ✔️ |
|
| 98 |
+
| nstbrowser_config | The config you want to send with requests to the NSTBrowser. _Scrapling defaults to an optimized NSTBrowser's docker browserless config if you leave this argument empty._ | ✔️ |
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
## Examples
|
| 102 |
+
It's easier to understand with examples, so let's look at it.
|
| 103 |
+
|
| 104 |
+
### Resource Control
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
# Disable unnecessary resources
|
| 108 |
+
page = PlayWrightFetcher.fetch(
|
| 109 |
+
'https://example.com',
|
| 110 |
+
disable_resources=True # Blocks fonts, images, media, etc...
|
| 111 |
+
)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### Network Control
|
| 115 |
+
|
| 116 |
+
```python
|
| 117 |
+
# Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
|
| 118 |
+
page = PlayWrightFetcher.fetch('https://example.com', network_idle=True)
|
| 119 |
+
|
| 120 |
+
# Custom timeout (in milliseconds)
|
| 121 |
+
page = PlayWrightFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
|
| 122 |
+
|
| 123 |
+
# Proxy support
|
| 124 |
+
page = PlayWrightFetcher.fetch(
|
| 125 |
+
'https://example.com',
|
| 126 |
+
proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
|
| 127 |
+
)
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
### Browser Automation
|
| 131 |
+
This is where your knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, does what you want, and then returns it again for the current fetcher to continue working on it.
|
| 132 |
+
|
| 133 |
+
This function is executed right after waiting for network_idle (if enabled) and before waiting for the `wait_selector` argument, so it can be used for many things, not just automation. You can alter the page as you want.
|
| 134 |
+
|
| 135 |
+
In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
|
| 136 |
+
```python
|
| 137 |
+
from playwright.sync_api import Page
|
| 138 |
+
|
| 139 |
+
def scroll_page(page: Page):
|
| 140 |
+
page.mouse.wheel(10, 0)
|
| 141 |
+
page.mouse.move(100, 400)
|
| 142 |
+
page.mouse.up()
|
| 143 |
+
return page
|
| 144 |
+
|
| 145 |
+
page = PlayWrightFetcher.fetch(
|
| 146 |
+
'https://example.com',
|
| 147 |
+
page_action=scroll_page
|
| 148 |
+
)
|
| 149 |
+
```
|
| 150 |
+
Of course, if you use the async fetch version, the function must also be async.
|
| 151 |
+
```python
|
| 152 |
+
from playwright.async_api import Page
|
| 153 |
+
|
| 154 |
+
async def scroll_page(page: Page):
|
| 155 |
+
await page.mouse.wheel(10, 0)
|
| 156 |
+
await page.mouse.move(100, 400)
|
| 157 |
+
await page.mouse.up()
|
| 158 |
+
return page
|
| 159 |
+
|
| 160 |
+
page = await PlayWrightFetcher.async_fetch(
|
| 161 |
+
'https://example.com',
|
| 162 |
+
page_action=scroll_page
|
| 163 |
+
)
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
### Wait Conditions
|
| 167 |
+
|
| 168 |
+
```python
|
| 169 |
+
# Wait for the selector
|
| 170 |
+
page = PlayWrightFetcher.fetch(
|
| 171 |
+
'https://example.com',
|
| 172 |
+
wait_selector='h1',
|
| 173 |
+
wait_selector_state='visible'
|
| 174 |
+
)
|
| 175 |
+
```
|
| 176 |
+
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 177 |
+
|
| 178 |
+
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) and wait for them to be. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 179 |
+
|
| 180 |
+
The states the fetcher can wait for can be either ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 181 |
+
|
| 182 |
+
- `attached`: Wait for an element to be present in DOM.
|
| 183 |
+
- `detached`: Wait for an element to not be present in DOM.
|
| 184 |
+
- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
|
| 185 |
+
- `hidden`: wait for an element to be either detached from DOM, or have an empty bounding box or `visibility:hidden`. This is opposite to the `'visible'` option.
|
| 186 |
+
|
| 187 |
+
### Some Stealth Features
|
| 188 |
+
|
| 189 |
+
```python
|
| 190 |
+
# Full stealth mode
|
| 191 |
+
page = PlayWrightFetcher.fetch(
|
| 192 |
+
'https://example.com',
|
| 193 |
+
stealth=True,
|
| 194 |
+
hide_canvas=True,
|
| 195 |
+
disable_webgl=True,
|
| 196 |
+
google_search=True
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
# Custom user agent
|
| 200 |
+
page = PlayWrightFetcher.fetch(
|
| 201 |
+
'https://example.com',
|
| 202 |
+
useragent='Mozilla/5.0...'
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
# Set browser locale
|
| 206 |
+
page = PlayWrightFetcher.fetch(
|
| 207 |
+
'https://example.com',
|
| 208 |
+
locale='en-US'
|
| 209 |
+
)
|
| 210 |
+
```
|
| 211 |
+
Hence, the `hide_canvas` argument doesn't disable canvas but hides it by adding random noise to canvas operations to prevent fingerprinting. Also, if you didn't set a useragent (preferred), the fetcher will generate a real Useragent of the same browser and use it.
|
| 212 |
+
|
| 213 |
+
The `google_search` argument is enabled by default, making the request look like it came from Google. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 214 |
+
|
| 215 |
+
### General example
|
| 216 |
+
```python
|
| 217 |
+
from scrapling.fetchers import PlayWrightFetcher
|
| 218 |
+
|
| 219 |
+
def scrape_dynamic_content():
|
| 220 |
+
# Use PlayWright for JavaScript content
|
| 221 |
+
page = PlayWrightFetcher.fetch(
|
| 222 |
+
'https://example.com/dynamic',
|
| 223 |
+
network_idle=True,
|
| 224 |
+
wait_selector='.content'
|
| 225 |
+
)
|
| 226 |
+
|
| 227 |
+
# Extract dynamic content
|
| 228 |
+
content = page.css('.content')
|
| 229 |
+
|
| 230 |
+
return {
|
| 231 |
+
'title': content.css_first('h1::text'),
|
| 232 |
+
'items': [
|
| 233 |
+
item.text for item in content.css('.item')
|
| 234 |
+
]
|
| 235 |
+
}
|
| 236 |
+
```
|
| 237 |
+
|
| 238 |
+
## When to Use
|
| 239 |
+
|
| 240 |
+
Use PlayWrightFetcher when:
|
| 241 |
+
|
| 242 |
+
- Need browser automation
|
| 243 |
+
- Want multiple browser options
|
| 244 |
+
- Using a real Chrome browser
|
| 245 |
+
- Need custom browser config
|
| 246 |
+
- Want flexible stealth options
|
| 247 |
+
|
| 248 |
+
If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
|
docs/fetching/static.md
ADDED
|
@@ -0,0 +1,300 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Introduction
|
| 2 |
+
|
| 3 |
+
The `Fetcher` class provides fast and lightweight HTTP requests with some stealth capabilities. This class uses [httpx](https://www.python-httpx.org/) as an engine for making requests. For advanced usages, you will need some knowledge about [httpx](https://www.python-httpx.org/), but it becomes simpler and simpler with user feedback and updates.
|
| 4 |
+
|
| 5 |
+
## Basic Usage
|
| 6 |
+
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 7 |
+
|
| 8 |
+
```python
|
| 9 |
+
>>> from scrapling.fetchers import Fetcher
|
| 10 |
+
```
|
| 11 |
+
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 12 |
+
|
| 13 |
+
### Shared arguments
|
| 14 |
+
All methods for making requests here share some arguments, so let's discuss them first.
|
| 15 |
+
|
| 16 |
+
- **url**: The URL you want to request, of course :)
|
| 17 |
+
- **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
|
| 18 |
+
- **stealthy_headers**: Generate and use real browser's headers, then create a referer header as if this request came from a Google search page of this URL's domain. Enabled by default, all headers generated can be overwritten by you through the `headers` argument.
|
| 19 |
+
- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. Enabled by default
|
| 20 |
+
- **timeout**: The timeout to wait for each request to be finished in milliseconds. The default is 30000ms (30 seconds).
|
| 21 |
+
- **retries**: The number of retries that [httpx](https://www.python-httpx.org/) will do for failed requests. The default number of retries is 3.
|
| 22 |
+
|
| 23 |
+
Other than this, you can pass any arguments that `httpx.<method_name>` takes, and that's why I said, in the beginning, you need a bit of knowledge about [httpx](https://www.python-httpx.org/), but in the following examples, we will try to cover most cases.
|
| 24 |
+
|
| 25 |
+
### HTTP Methods
|
| 26 |
+
Examples are the best way to explain this
|
| 27 |
+
|
| 28 |
+
> Hence: `OPTIONS` and `HEAD` methods are not supported.
|
| 29 |
+
#### GET
|
| 30 |
+
```python
|
| 31 |
+
>>> from scrapling.fetchers import Fetcher
|
| 32 |
+
>>> # Basic GET
|
| 33 |
+
>>> page = Fetcher.get('https://example.com')
|
| 34 |
+
>>> page = Fetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
|
| 35 |
+
>>> page = Fetcher.get('https://httpbin.org/get', proxy='http://username:password@localhost:8030')
|
| 36 |
+
>>> # With parameters
|
| 37 |
+
>>> page = Fetcher.get('https://example.com/search', params={'q': 'query'})
|
| 38 |
+
>>>
|
| 39 |
+
>>> # With headers
|
| 40 |
+
>>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
|
| 41 |
+
>>> # Basic HTTP authentication
|
| 42 |
+
>>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
|
| 43 |
+
```
|
| 44 |
+
And for asynchronous requests, it's a small adjustment
|
| 45 |
+
```python
|
| 46 |
+
>>> from scrapling.fetchers import AsyncFetcher
|
| 47 |
+
>>> # Basic GET
|
| 48 |
+
>>> page = await AsyncFetcher.get('https://example.com')
|
| 49 |
+
>>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
|
| 50 |
+
>>> page = await AsyncFetcher.get('https://httpbin.org/get', proxy='http://username:password@localhost:8030')
|
| 51 |
+
>>> # With parameters
|
| 52 |
+
>>> page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})
|
| 53 |
+
>>>
|
| 54 |
+
>>> # With headers
|
| 55 |
+
>>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
|
| 56 |
+
>>> # Basic HTTP authentication
|
| 57 |
+
>>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
|
| 58 |
+
```
|
| 59 |
+
Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is an `Adaptor` as we said, so you will use it directly
|
| 60 |
+
```python
|
| 61 |
+
>>> page.css('.something.something')
|
| 62 |
+
|
| 63 |
+
>>> page = Fetcher.get('https://api.github.com/events')
|
| 64 |
+
>>> page.json()
|
| 65 |
+
[{'id': '<redacted>',
|
| 66 |
+
'type': 'PushEvent',
|
| 67 |
+
'actor': {'id': '<redacted>',
|
| 68 |
+
'login': '<redacted>',
|
| 69 |
+
'display_login': '<redacted>',
|
| 70 |
+
'gravatar_id': '',
|
| 71 |
+
'url': 'https://api.github.com/users/<redacted>',
|
| 72 |
+
'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'},
|
| 73 |
+
'repo': {'id': '<redacted>',
|
| 74 |
+
...
|
| 75 |
+
```
|
| 76 |
+
#### POST
|
| 77 |
+
```python
|
| 78 |
+
>>> from scrapling.fetchers import Fetcher
|
| 79 |
+
>>> # Basic POST
|
| 80 |
+
>>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'})
|
| 81 |
+
>>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
|
| 82 |
+
>>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
|
| 83 |
+
>>> # Another example of form-encoded data
|
| 84 |
+
>>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
|
| 85 |
+
>>> # JSON data
|
| 86 |
+
>>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
|
| 87 |
+
>>> # Uploading file
|
| 88 |
+
>>> r = Fetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
|
| 89 |
+
```
|
| 90 |
+
And for asynchronous requests, it's a small adjustment
|
| 91 |
+
```python
|
| 92 |
+
>>> from scrapling.fetchers import AsyncFetcher
|
| 93 |
+
>>> # Basic POST
|
| 94 |
+
>>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'})
|
| 95 |
+
>>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
|
| 96 |
+
>>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
|
| 97 |
+
>>> # Another example of form-encoded data
|
| 98 |
+
>>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
|
| 99 |
+
>>> # JSON data
|
| 100 |
+
>>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
|
| 101 |
+
>>> # Uploading file
|
| 102 |
+
>>> r = await AsyncFetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
|
| 103 |
+
```
|
| 104 |
+
#### PUT
|
| 105 |
+
```python
|
| 106 |
+
>>> from scrapling.fetchers import Fetcher
|
| 107 |
+
>>> # Basic PUT
|
| 108 |
+
>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
|
| 109 |
+
>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
|
| 110 |
+
>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
|
| 111 |
+
>>> # Another example of form-encoded data
|
| 112 |
+
>>> page = Fetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
|
| 113 |
+
```
|
| 114 |
+
And for asynchronous requests, it's a small adjustment
|
| 115 |
+
```python
|
| 116 |
+
>>> from scrapling.fetchers import AsyncFetcher
|
| 117 |
+
>>> # Basic PUT
|
| 118 |
+
>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
|
| 119 |
+
>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
|
| 120 |
+
>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
|
| 121 |
+
>>> # Another example of form-encoded data
|
| 122 |
+
>>> page = await AsyncFetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
#### DELETE
|
| 126 |
+
```python
|
| 127 |
+
>>> from scrapling.fetchers import Fetcher
|
| 128 |
+
>>> page = Fetcher.delete('https://example.com/resource/123')
|
| 129 |
+
>>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
|
| 130 |
+
>>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
|
| 131 |
+
```
|
| 132 |
+
And for asynchronous requests, it's a small adjustment
|
| 133 |
+
```python
|
| 134 |
+
>>> from scrapling.fetchers import AsyncFetcher
|
| 135 |
+
>>> page = await AsyncFetcher.delete('https://example.com/resource/123')
|
| 136 |
+
>>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
|
| 137 |
+
>>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
## Examples
|
| 141 |
+
Some well-rounded examples to aid newcomers to Web Scraping
|
| 142 |
+
|
| 143 |
+
### Basic HTTP Request
|
| 144 |
+
|
| 145 |
+
```python
|
| 146 |
+
from scrapling.fetchers import Fetcher
|
| 147 |
+
|
| 148 |
+
# Make a request
|
| 149 |
+
page = Fetcher.get('https://example.com')
|
| 150 |
+
|
| 151 |
+
# Check the status
|
| 152 |
+
if page.status == 200:
|
| 153 |
+
# Extract title
|
| 154 |
+
title = page.css_first('title::text')
|
| 155 |
+
print(f"Page title: {title}")
|
| 156 |
+
|
| 157 |
+
# Extract all links
|
| 158 |
+
links = page.css('a::attr(href)')
|
| 159 |
+
print(f"Found {len(links)} links")
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
### Product Scraping
|
| 163 |
+
|
| 164 |
+
```python
|
| 165 |
+
from scrapling.fetchers import Fetcher
|
| 166 |
+
|
| 167 |
+
def scrape_products():
|
| 168 |
+
page = Fetcher.get('https://example.com/products')
|
| 169 |
+
|
| 170 |
+
# Find all product elements
|
| 171 |
+
products = page.css('.product')
|
| 172 |
+
|
| 173 |
+
results = []
|
| 174 |
+
for product in products:
|
| 175 |
+
results.append({
|
| 176 |
+
'title': product.css_first('.title::text'),
|
| 177 |
+
'price': product.css_first('.price::text').re_first(r'\d+\.\d{2}'),
|
| 178 |
+
'description': product.css_first('.description::text'),
|
| 179 |
+
'in_stock': product.has_class('in-stock')
|
| 180 |
+
})
|
| 181 |
+
|
| 182 |
+
return results
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
### Pagination Handling
|
| 186 |
+
|
| 187 |
+
```python
|
| 188 |
+
from scrapling.fetchers import Fetcher
|
| 189 |
+
|
| 190 |
+
def scrape_all_pages():
|
| 191 |
+
base_url = 'https://example.com/products?page={}'
|
| 192 |
+
page_num = 1
|
| 193 |
+
all_products = []
|
| 194 |
+
|
| 195 |
+
while True:
|
| 196 |
+
# Get current page
|
| 197 |
+
page = Fetcher.get(base_url.format(page_num))
|
| 198 |
+
|
| 199 |
+
# Find products
|
| 200 |
+
products = page.css('.product')
|
| 201 |
+
if not products:
|
| 202 |
+
break
|
| 203 |
+
|
| 204 |
+
# Process products
|
| 205 |
+
for product in products:
|
| 206 |
+
all_products.append({
|
| 207 |
+
'name': product.css_first('.name::text'),
|
| 208 |
+
'price': product.css_first('.price::text')
|
| 209 |
+
})
|
| 210 |
+
|
| 211 |
+
# Next page
|
| 212 |
+
page_num += 1
|
| 213 |
+
|
| 214 |
+
return all_products
|
| 215 |
+
```
|
| 216 |
+
|
| 217 |
+
### Form Submission
|
| 218 |
+
|
| 219 |
+
```python
|
| 220 |
+
from scrapling.fetchers import Fetcher
|
| 221 |
+
|
| 222 |
+
# Submit login form
|
| 223 |
+
response = Fetcher.post(
|
| 224 |
+
'https://example.com/login',
|
| 225 |
+
data={
|
| 226 |
+
'username': 'user@example.com',
|
| 227 |
+
'password': 'password123'
|
| 228 |
+
}
|
| 229 |
+
)
|
| 230 |
+
|
| 231 |
+
# Check login success
|
| 232 |
+
if response.status == 200:
|
| 233 |
+
# Extract user info
|
| 234 |
+
user_name = response.css_first('.user-name::text')
|
| 235 |
+
print(f"Logged in as: {user_name}")
|
| 236 |
+
```
|
| 237 |
+
|
| 238 |
+
### Table Extraction
|
| 239 |
+
|
| 240 |
+
```python
|
| 241 |
+
from scrapling.fetchers import Fetcher
|
| 242 |
+
|
| 243 |
+
def extract_table():
|
| 244 |
+
page = Fetcher.get('https://example.com/data')
|
| 245 |
+
|
| 246 |
+
# Find table
|
| 247 |
+
table = page.css_first('table')
|
| 248 |
+
|
| 249 |
+
# Extract headers
|
| 250 |
+
headers = [
|
| 251 |
+
th.text for th in table.css('thead th')
|
| 252 |
+
]
|
| 253 |
+
|
| 254 |
+
# Extract rows
|
| 255 |
+
rows = []
|
| 256 |
+
for row in table.css('tbody tr'):
|
| 257 |
+
cells = [td.text for td in row.css('td')]
|
| 258 |
+
rows.append(dict(zip(headers, cells)))
|
| 259 |
+
|
| 260 |
+
return rows
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
### Navigation Menu
|
| 264 |
+
|
| 265 |
+
```python
|
| 266 |
+
from scrapling.fetchers import Fetcher
|
| 267 |
+
|
| 268 |
+
def extract_menu():
|
| 269 |
+
page = Fetcher.get('https://example.com')
|
| 270 |
+
|
| 271 |
+
# Find navigation
|
| 272 |
+
nav = page.css_first('nav')
|
| 273 |
+
|
| 274 |
+
menu = {}
|
| 275 |
+
for item in nav.css('li'):
|
| 276 |
+
link = item.css_first('a')
|
| 277 |
+
if link:
|
| 278 |
+
menu[link.text] = {
|
| 279 |
+
'url': link.attrib['href'],
|
| 280 |
+
'has_submenu': bool(item.css('.submenu'))
|
| 281 |
+
}
|
| 282 |
+
|
| 283 |
+
return menu
|
| 284 |
+
```
|
| 285 |
+
|
| 286 |
+
## When to Use
|
| 287 |
+
|
| 288 |
+
Use `Fetcher` when:
|
| 289 |
+
|
| 290 |
+
- Need fast HTTP requests
|
| 291 |
+
- Want minimal overhead
|
| 292 |
+
- Don't need JavaScript
|
| 293 |
+
- Want simple configuration
|
| 294 |
+
- Need basic stealth features
|
| 295 |
+
|
| 296 |
+
Use other fetchers when:
|
| 297 |
+
|
| 298 |
+
- Need browser automation.
|
| 299 |
+
- Need advanced anti-bot/stealth.
|
| 300 |
+
- Need JavaScript support.
|
docs/fetching/stealthy.md
ADDED
|
@@ -0,0 +1,218 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Introduction
|
| 2 |
+
|
| 3 |
+
Here, we will discuss the `StealthyFetcher` class. This class is similar to [PlayWrightFetcher](dynamic.md#introduction) in many ways, like browser automation and using [PlayWright](https://playwright.dev/python/docs/intro) as an engine for fetching websites. The main difference is that this class provides advanced anti-bot protection bypass capabilities and a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
|
| 4 |
+
|
| 5 |
+
As with [PlayWrightFetcher](dynamic.md#introduction), you will need some knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
|
| 6 |
+
|
| 7 |
+
## Basic Usage
|
| 8 |
+
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 9 |
+
|
| 10 |
+
```python
|
| 11 |
+
>>> from scrapling.fetchers import StealthyFetcher
|
| 12 |
+
```
|
| 13 |
+
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 14 |
+
|
| 15 |
+
> Notes:
|
| 16 |
+
>
|
| 17 |
+
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (waits for the `domcontentloaded` state).
|
| 18 |
+
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 19 |
+
|
| 20 |
+
## Full list of arguments
|
| 21 |
+
Before jumping to [examples](#examples), here's the full list of arguments
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
| Argument | Description | Optional |
|
| 25 |
+
|:--------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 26 |
+
| url | Target url | ❌ |
|
| 27 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | ✔️ |
|
| 28 |
+
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 29 |
+
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
|
| 30 |
+
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search for this website's domain name. | ✔️ |
|
| 31 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 32 |
+
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 33 |
+
| page_action | Added for automation. A function that takes the `page` object and does the automation you need, then returns `page` again. | ✔️ |
|
| 34 |
+
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 35 |
+
| humanize | Humanize the cursor movement. The cursor movement takes either True or the MAX duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
|
| 36 |
+
| allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
| 37 |
+
| geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
|
| 38 |
+
| os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | ✔️ |
|
| 39 |
+
| disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
|
| 40 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 41 |
+
| timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
|
| 42 |
+
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 43 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 44 |
+
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 45 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
|
| 46 |
+
| additional_arguments | Arguments passed to Camoufox as additional settings that take higher priority than Scrapling's. | ✔️ |
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
## Examples
|
| 50 |
+
It's easier to understand with examples, so now we will go over most of the arguments individually with examples.
|
| 51 |
+
|
| 52 |
+
### Browser Modes
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
# Headless/hidden mode (default)
|
| 56 |
+
page = StealthyFetcher.fetch('https://example.com', headless=True)
|
| 57 |
+
|
| 58 |
+
# Virtual display mode (requires having `xvfb` installed)
|
| 59 |
+
page = StealthyFetcher.fetch('https://example.com', headless='virtual')
|
| 60 |
+
|
| 61 |
+
# Visible browser mode
|
| 62 |
+
page = StealthyFetcher.fetch('https://example.com', headless=False)
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### Resource Control
|
| 66 |
+
|
| 67 |
+
```python
|
| 68 |
+
# Block images
|
| 69 |
+
page = StealthyFetcher.fetch('https://example.com', block_images=True)
|
| 70 |
+
|
| 71 |
+
# Disable unnecessary resources
|
| 72 |
+
page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### Additional stealth options
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
page = StealthyFetcher.fetch(
|
| 79 |
+
'https://example.com',
|
| 80 |
+
block_webrtc=True, # Block WebRTC
|
| 81 |
+
allow_webgl=False, # Disable WebGL
|
| 82 |
+
humanize=True, # Make the mouse move as how a human would move it
|
| 83 |
+
geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
|
| 84 |
+
os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
|
| 85 |
+
disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
|
| 86 |
+
google_search=True
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
# Custom user agent
|
| 90 |
+
page = StealthyFetcher.fetch(
|
| 91 |
+
'https://example.com',
|
| 92 |
+
useragent='Mozilla/5.0...'
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
# Custom humanization duration
|
| 96 |
+
page = StealthyFetcher.fetch(
|
| 97 |
+
'https://example.com',
|
| 98 |
+
humanize=1.5 # Max 1.5 seconds for cursor movement
|
| 99 |
+
)
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
The `google_search` argument is enabled by default. It makes the request as if it came from Google, so for a request for `https://example.com`, it will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 103 |
+
|
| 104 |
+
### Network Control
|
| 105 |
+
|
| 106 |
+
```python
|
| 107 |
+
# Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
|
| 108 |
+
page = StealthyFetcher.fetch('https://example.com', network_idle=True)
|
| 109 |
+
|
| 110 |
+
# Custom timeout (in milliseconds)
|
| 111 |
+
page = StealthyFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
|
| 112 |
+
|
| 113 |
+
# Proxy support
|
| 114 |
+
page = StealthyFetcher.fetch(
|
| 115 |
+
'https://example.com',
|
| 116 |
+
proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
|
| 117 |
+
)
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
### Browser Automation
|
| 121 |
+
This is where your knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, does what you want, and then returns it again for the current fetcher to continue working on it.
|
| 122 |
+
|
| 123 |
+
This function is executed right after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, so it can be used for many things, not just automation. You can alter the page as you want.
|
| 124 |
+
|
| 125 |
+
In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
|
| 126 |
+
```python
|
| 127 |
+
from playwright.sync_api import Page
|
| 128 |
+
|
| 129 |
+
def scroll_page(page: Page):
|
| 130 |
+
page.mouse.wheel(10, 0)
|
| 131 |
+
page.mouse.move(100, 400)
|
| 132 |
+
page.mouse.up()
|
| 133 |
+
return page
|
| 134 |
+
|
| 135 |
+
page = StealthyFetcher.fetch(
|
| 136 |
+
'https://example.com',
|
| 137 |
+
page_action=scroll_page
|
| 138 |
+
)
|
| 139 |
+
```
|
| 140 |
+
Of course, if you use the async fetch version, the function must also be async.
|
| 141 |
+
```python
|
| 142 |
+
from playwright.async_api import Page
|
| 143 |
+
|
| 144 |
+
async def scroll_page(page: Page):
|
| 145 |
+
await page.mouse.wheel(10, 0)
|
| 146 |
+
await page.mouse.move(100, 400)
|
| 147 |
+
await page.mouse.up()
|
| 148 |
+
return page
|
| 149 |
+
|
| 150 |
+
page = await StealthyFetcher.async_fetch(
|
| 151 |
+
'https://example.com',
|
| 152 |
+
page_action=scroll_page
|
| 153 |
+
)
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### Wait Conditions
|
| 157 |
+
```python
|
| 158 |
+
# Wait for the selector
|
| 159 |
+
page = StealthyFetcher.fetch(
|
| 160 |
+
'https://example.com',
|
| 161 |
+
wait_selector='h1',
|
| 162 |
+
wait_selector_state='visible'
|
| 163 |
+
)
|
| 164 |
+
```
|
| 165 |
+
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 166 |
+
|
| 167 |
+
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) and wait for them to be. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 168 |
+
|
| 169 |
+
The states the fetcher can wait for can be either ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 170 |
+
|
| 171 |
+
- `attached`: wait for the element to be present in DOM.
|
| 172 |
+
- `detached`: wait for the element to not be present in DOM.
|
| 173 |
+
- `visible`: wait for the element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
|
| 174 |
+
- `hidden`: Wait for the element to be detached from DOM, have an empty bounding box, or have `visibility:hidden`. This is opposite to the `'visible'` option.
|
| 175 |
+
|
| 176 |
+
### Firefox Addons
|
| 177 |
+
|
| 178 |
+
```python
|
| 179 |
+
# Custom Firefox addons
|
| 180 |
+
page = StealthyFetcher.fetch(
|
| 181 |
+
'https://example.com',
|
| 182 |
+
addons=['/path/to/addon1', '/path/to/addon2']
|
| 183 |
+
)
|
| 184 |
+
```
|
| 185 |
+
The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
|
| 186 |
+
|
| 187 |
+
### Real-world example (Amazon)
|
| 188 |
+
This is for educational purposes only; this example was generated by AI, which shows too how easy it is to work with Scrapling through AI
|
| 189 |
+
```python
|
| 190 |
+
def scrape_amazon_product(url):
|
| 191 |
+
# Use StealthyFetcher to bypass protection
|
| 192 |
+
page = StealthyFetcher.fetch(url)
|
| 193 |
+
|
| 194 |
+
# Extract product details
|
| 195 |
+
return {
|
| 196 |
+
'title': page.css_first('#productTitle::text').clean(),
|
| 197 |
+
'price': page.css_first('.a-price .a-offscreen::text'),
|
| 198 |
+
'rating': page.css_first('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text'),
|
| 199 |
+
'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
|
| 200 |
+
'features': [
|
| 201 |
+
li.clean() for li in page.css('#feature-bullets li span::text')
|
| 202 |
+
],
|
| 203 |
+
'availability': page.css_first('#availability').get_all_text(strip=True),
|
| 204 |
+
'images': [
|
| 205 |
+
img.attrib['src'] for img in page.css('#altImages img')
|
| 206 |
+
]
|
| 207 |
+
}
|
| 208 |
+
```
|
| 209 |
+
|
| 210 |
+
## When to Use
|
| 211 |
+
|
| 212 |
+
Use StealthyFetcher when:
|
| 213 |
+
|
| 214 |
+
- Bypassing anti-bot protection
|
| 215 |
+
- Need a reliable browser fingerprint
|
| 216 |
+
- Full JavaScript support needed
|
| 217 |
+
- Want automatic stealth features
|
| 218 |
+
- Need browser automation
|
docs/index.md
CHANGED
|
@@ -1,2 +1,107 @@
|
|
| 1 |
-
#
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Scrapling
|
| 2 |
+
|
| 3 |
+
Scrapling is an Undetectable, high-performance, intelligent Web scraping library for Python 3 to make Web Scraping easy!
|
| 4 |
+
|
| 5 |
+
Scrapling isn't only about making undetectable requests or fetching pages under the radar!
|
| 6 |
+
|
| 7 |
+
It has its own parser that adapts to website changes and provides many element selection/querying options other than traditional selectors, powerful DOM traversal API, and many other features while significantly outperforming popular parsing alternatives.
|
| 8 |
+
|
| 9 |
+
Scrapling is built from the ground up by Web scraping experts for beginners and experts. The goal is to provide powerful features while maintaining simplicity and minimal boilerplate code.
|
| 10 |
+
|
| 11 |
+
```python
|
| 12 |
+
>> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher
|
| 13 |
+
>> StealthyFetcher.auto_match = True
|
| 14 |
+
# Fetch websites' source under the radar!
|
| 15 |
+
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
|
| 16 |
+
>> print(page.status)
|
| 17 |
+
200
|
| 18 |
+
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
|
| 19 |
+
>> # Later, if the website structure changes, pass `auto_match=True`
|
| 20 |
+
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!
|
| 21 |
+
```
|
| 22 |
+
## Key Features
|
| 23 |
+
### Fetch websites as you prefer with async support
|
| 24 |
+
- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class.
|
| 25 |
+
- **Dynamic Loading & Automation**: Fetch dynamic websites with the `PlayWrightFetcher` class through your real browser, Scrapling's stealth mode, Playwright's Chromium browser, or [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless!
|
| 26 |
+
- **Anti-bot Protections Bypass**: Easily bypass protections with the `StealthyFetcher` and `PlayWrightFetcher` classes.
|
| 27 |
+
|
| 28 |
+
### Easy Scraping
|
| 29 |
+
- **Smart Element Tracking**: Relocate elements after website changes using an intelligent similarity system and integrated storage.
|
| 30 |
+
- **Flexible Selection**: CSS selectors, XPath selectors, filters-based search, text search, regex search, and more.
|
| 31 |
+
- **Find Similar Elements**: Automatically locate elements similar to the element you found!
|
| 32 |
+
- **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.
|
| 33 |
+
|
| 34 |
+
### High Performance
|
| 35 |
+
- **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries.
|
| 36 |
+
- **Memory Efficient**: Optimized data structures for minimal memory footprint.
|
| 37 |
+
- **Fast JSON serialization**: 10x faster than standard library.
|
| 38 |
+
|
| 39 |
+
### Developer Friendly
|
| 40 |
+
- **Powerful Navigation API**: Easy DOM traversal in all directions.
|
| 41 |
+
- **Rich Text Processing**: All strings have built-in regex, cleaning methods, and more. All elements' attributes are optimized dictionaries that use less memory than standard dictionaries with added methods.
|
| 42 |
+
- **Auto Selectors Generation**: Generate robust short and full CSS/XPath selectors for any element.
|
| 43 |
+
- **Familiar API**: Similar to Scrapy/BeautifulSoup and the same CSS pseudo-elements used in Scrapy.
|
| 44 |
+
- **Type hints**: Complete type/doc-strings coverage for future-proofing and best autocompletion support.
|
| 45 |
+
|
| 46 |
+
## Star History
|
| 47 |
+
Scrapling’s GitHub stars have grown steadily since its release (see chart below).
|
| 48 |
+
|
| 49 |
+
<div id="chartContainer">
|
| 50 |
+
<a href="https://github.com/D4Vinci/Scrapling">
|
| 51 |
+
<img id="chartImage" alt="Star History Chart" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date" height="400"/>
|
| 52 |
+
</a>
|
| 53 |
+
</div>
|
| 54 |
+
|
| 55 |
+
<script>
|
| 56 |
+
const observer = new MutationObserver((mutations) => {
|
| 57 |
+
mutations.forEach((mutation) => {
|
| 58 |
+
if (mutation.attributeName === 'data-md-color-media') {
|
| 59 |
+
const colorMedia = document.body.getAttribute('data-md-color-media');
|
| 60 |
+
const isDarkScheme = document.body.getAttribute('data-md-color-scheme') === 'slate';
|
| 61 |
+
const chartImg = document.querySelector('#chartImage');
|
| 62 |
+
const baseUrl = 'https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date';
|
| 63 |
+
|
| 64 |
+
if (colorMedia === '(prefers-color-scheme)' ? isDarkScheme : colorMedia.includes('dark')) {
|
| 65 |
+
chartImg.src = `${baseUrl}&theme=dark`;
|
| 66 |
+
} else {
|
| 67 |
+
chartImg.src = baseUrl;
|
| 68 |
+
}
|
| 69 |
+
}
|
| 70 |
+
});
|
| 71 |
+
});
|
| 72 |
+
|
| 73 |
+
observer.observe(document.body, {
|
| 74 |
+
attributes: true,
|
| 75 |
+
attributeFilter: ['data-md-color-media', 'data-md-color-scheme']
|
| 76 |
+
});
|
| 77 |
+
</script>
|
| 78 |
+
|
| 79 |
+
## Installation
|
| 80 |
+
Scrapling is a breeze to get started with!<br/>Starting from version 0.2.9, we require at least Python 3.9 to work.
|
| 81 |
+
|
| 82 |
+
Run this command to install it with Python's pip.
|
| 83 |
+
```bash
|
| 84 |
+
pip3 install scrapling
|
| 85 |
+
```
|
| 86 |
+
You are ready if you plan to use the parser only (the `Adaptor` class).
|
| 87 |
+
|
| 88 |
+
But if you are going to make requests or fetch pages with Scrapling, then run this command to install browsers' dependencies needed to use the Fetchers
|
| 89 |
+
```bash
|
| 90 |
+
scrapling install
|
| 91 |
+
```
|
| 92 |
+
If you have any installation issues, please open an [issue](https://github.com/D4Vinci/Scrapling/issues/new/choose).
|
| 93 |
+
|
| 94 |
+
## How the documentation is organized
|
| 95 |
+
Scrapling has a lot of documentation, so we try to follow a guideline called the [Diátaxis documentation framework](https://diataxis.fr/).
|
| 96 |
+
|
| 97 |
+
## Support
|
| 98 |
+
|
| 99 |
+
If you like Scrapling and want to support its development:
|
| 100 |
+
|
| 101 |
+
- ⭐ Star the [GitHub repository](https://github.com/D4Vinci/Scrapling)
|
| 102 |
+
- 💝 Consider [sponsoring the project or buying me a coffe](donate.md) :wink:
|
| 103 |
+
- 🐛 Report bugs and suggest features through [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues)
|
| 104 |
+
|
| 105 |
+
## License
|
| 106 |
+
|
| 107 |
+
This project is licensed under BSD-3 License. See the [LICENSE](https://github.com/D4Vinci/Scrapling/blob/main/LICENSE) file for details.
|
docs/overview.md
ADDED
|
@@ -0,0 +1,328 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
We will start by quickly reviewing the parsing capabilities. Then, we will fetch websites with custom browsers, make requests, and parse the response.
|
| 2 |
+
|
| 3 |
+
Here's an HTML document generated by ChatGPT we will be using as an example throughout this page:
|
| 4 |
+
```html
|
| 5 |
+
<html>
|
| 6 |
+
<head>
|
| 7 |
+
<title>Complex Web Page</title>
|
| 8 |
+
<style>
|
| 9 |
+
.hidden { display: none; }
|
| 10 |
+
</style>
|
| 11 |
+
</head>
|
| 12 |
+
<body>
|
| 13 |
+
<header>
|
| 14 |
+
<nav>
|
| 15 |
+
<ul>
|
| 16 |
+
<li> <a href="#home">Home</a> </li>
|
| 17 |
+
<li> <a href="#about">About</a> </li>
|
| 18 |
+
<li> <a href="#contact">Contact</a> </li>
|
| 19 |
+
</ul>
|
| 20 |
+
</nav>
|
| 21 |
+
</header>
|
| 22 |
+
<main>
|
| 23 |
+
<section id="products" schema='{"jsonable": "data"}'>
|
| 24 |
+
<h2>Products</h2>
|
| 25 |
+
<div class="product-list">
|
| 26 |
+
<article class="product" data-id="1">
|
| 27 |
+
<h3>Product 1</h3>
|
| 28 |
+
<p class="description">This is product 1</p>
|
| 29 |
+
<span class="price">$10.99</span>
|
| 30 |
+
<div class="hidden stock">In stock: 5</div>
|
| 31 |
+
</article>
|
| 32 |
+
|
| 33 |
+
<article class="product" data-id="2">
|
| 34 |
+
<h3>Product 2</h3>
|
| 35 |
+
<p class="description">This is product 2</p>
|
| 36 |
+
<span class="price">$20.99</span>
|
| 37 |
+
<div class="hidden stock">In stock: 3</div>
|
| 38 |
+
</article>
|
| 39 |
+
|
| 40 |
+
<article class="product" data-id="3">
|
| 41 |
+
<h3>Product 3</h3>
|
| 42 |
+
<p class="description">This is product 3</p>
|
| 43 |
+
<span class="price">$15.99</span>
|
| 44 |
+
<div class="hidden stock">Out of stock</div>
|
| 45 |
+
</article>
|
| 46 |
+
</div>
|
| 47 |
+
</section>
|
| 48 |
+
|
| 49 |
+
<section id="reviews">
|
| 50 |
+
<h2>Customer Reviews</h2>
|
| 51 |
+
<div class="review-list">
|
| 52 |
+
<div class="review" data-rating="5">
|
| 53 |
+
<p class="review-text">Great product!</p>
|
| 54 |
+
<span class="reviewer">John Doe</span>
|
| 55 |
+
</div>
|
| 56 |
+
<div class="review" data-rating="4">
|
| 57 |
+
<p class="review-text">Good value for money.</p>
|
| 58 |
+
<span class="reviewer">Jane Smith</span>
|
| 59 |
+
</div>
|
| 60 |
+
</div>
|
| 61 |
+
</section>
|
| 62 |
+
</main>
|
| 63 |
+
<script id="page-data" type="application/json">
|
| 64 |
+
{
|
| 65 |
+
"lastUpdated": "2024-09-22T10:30:00Z",
|
| 66 |
+
"totalProducts": 3
|
| 67 |
+
}
|
| 68 |
+
</script>
|
| 69 |
+
</body>
|
| 70 |
+
</html>
|
| 71 |
+
```
|
| 72 |
+
Starting with loading raw HTML above like this
|
| 73 |
+
```python
|
| 74 |
+
from scrapling.parser import Adaptor
|
| 75 |
+
page = Adaptor(html_doc)
|
| 76 |
+
page # <data='<html><head><title>Complex Web Page</tit...'>
|
| 77 |
+
```
|
| 78 |
+
Get all text content on the page recursively
|
| 79 |
+
```python
|
| 80 |
+
page.get_all_text(ignore_tags=('script', 'style'))
|
| 81 |
+
# 'Complex Web Page\nHome\nAbout\nContact\nProducts\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock\nCustomer Reviews\nGreat product!\nJohn Doe\nGood value for money.\nJane Smith'
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Finding elements
|
| 85 |
+
If there's an element you want to find on the page, you will! Your creativity level is the only limitation!
|
| 86 |
+
|
| 87 |
+
Finding the first HTML `section` element
|
| 88 |
+
```python
|
| 89 |
+
section_element = page.find('section')
|
| 90 |
+
# <data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>
|
| 91 |
+
```
|
| 92 |
+
Find all `section` elements
|
| 93 |
+
```python
|
| 94 |
+
section_elements = page.find_all('section')
|
| 95 |
+
# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>, <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>]
|
| 96 |
+
```
|
| 97 |
+
Find all `section` elements whose `id` attribute value is `products`
|
| 98 |
+
```python
|
| 99 |
+
section_elements = page.find_all('section', {'id':"products"})
|
| 100 |
+
# Same as
|
| 101 |
+
section_elements = page.find_all('section', id="products")
|
| 102 |
+
# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>]
|
| 103 |
+
```
|
| 104 |
+
Find all `section` elements that its `id` attribute value contains `product`
|
| 105 |
+
```python
|
| 106 |
+
section_elements = page.find_all('section', {'id*':"product"})
|
| 107 |
+
```
|
| 108 |
+
Find all `h3` elements whose text content matches this regex `Product \d`
|
| 109 |
+
```python
|
| 110 |
+
page.find_all('h3', re.compile(r'Product \d'))
|
| 111 |
+
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
|
| 112 |
+
```
|
| 113 |
+
Find all `h3` and `h2` elements whose text content matches regex `Product` only
|
| 114 |
+
```python
|
| 115 |
+
page.find_all(['h3', 'h2'], re.compile(r'Product'))
|
| 116 |
+
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
|
| 117 |
+
```
|
| 118 |
+
Find all elements that its text content matches exactly `Products` (Whitespaces are not taken into consideration)
|
| 119 |
+
```python
|
| 120 |
+
page.find_by_text('Products', first_match=False)
|
| 121 |
+
# [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>]
|
| 122 |
+
```
|
| 123 |
+
Or find all elements whose text content matches regex `Product \d`
|
| 124 |
+
```python
|
| 125 |
+
page.find_by_regex(r'Product \d', first_match=False)
|
| 126 |
+
# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
|
| 127 |
+
```
|
| 128 |
+
Find all elements that are similar to the element you want
|
| 129 |
+
```python
|
| 130 |
+
target_element = page.find_by_regex(r'Product \d', first_match=True)
|
| 131 |
+
# <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>
|
| 132 |
+
target_element.find_similar()
|
| 133 |
+
# [<data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>]
|
| 134 |
+
```
|
| 135 |
+
Find the first element that matches a CSS selector
|
| 136 |
+
```python
|
| 137 |
+
page.css_first('.product-list [data-id="1"]')
|
| 138 |
+
# <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 139 |
+
```
|
| 140 |
+
Find all elements that match a CSS selector
|
| 141 |
+
```python
|
| 142 |
+
page.css('.product-list article')
|
| 143 |
+
# [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
| 144 |
+
```
|
| 145 |
+
Find the first element that matches an XPath selector
|
| 146 |
+
```python
|
| 147 |
+
page.xpath_first("//*[@id='products']/div/article")
|
| 148 |
+
# <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 149 |
+
```
|
| 150 |
+
Find all elements that match an XPath selector
|
| 151 |
+
```python
|
| 152 |
+
page.xpath("//*[@id='products']/div/article")
|
| 153 |
+
# [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
With this, we just scratched the surface of these functions; more advanced options with these selection methods are shown later.
|
| 157 |
+
## Accessing elements' data
|
| 158 |
+
It's as simple as
|
| 159 |
+
```python
|
| 160 |
+
>>> section_element.tag
|
| 161 |
+
'section'
|
| 162 |
+
>>> print(section_element.attrib)
|
| 163 |
+
{'id': 'products', 'schema': '{"jsonable": "data"}'}
|
| 164 |
+
>>> section_element.attrib['schema'].json() # If an attribute value can be converted to json, then use `.json()` to convert it
|
| 165 |
+
{'jsonable': 'data'}
|
| 166 |
+
>>> section_element.text # Direct text content
|
| 167 |
+
''
|
| 168 |
+
>>> section_element.get_all_text() # All text content recursively
|
| 169 |
+
'Products\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
|
| 170 |
+
>>> section_element.html_content # The HTML content of the element
|
| 171 |
+
'<section id="products" schema=\'{"jsonable": "data"}\'><h2>Products</h2>\n <div class="product-list">\n <article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article><article class="product" data-id="2"><h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article><article class="product" data-id="3"><h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article></div>\n </section>'
|
| 172 |
+
>>> print(section_element.prettify()) # The prettified version
|
| 173 |
+
'''
|
| 174 |
+
<section id="products" schema='{"jsonable": "data"}'><h2>Products</h2>
|
| 175 |
+
<div class="product-list">
|
| 176 |
+
<article class="product" data-id="1"><h3>Product 1</h3>
|
| 177 |
+
<p class="description">This is product 1</p>
|
| 178 |
+
<span class="price">$10.99</span>
|
| 179 |
+
<div class="hidden stock">In stock: 5</div>
|
| 180 |
+
</article><article class="product" data-id="2"><h3>Product 2</h3>
|
| 181 |
+
<p class="description">This is product 2</p>
|
| 182 |
+
<span class="price">$20.99</span>
|
| 183 |
+
<div class="hidden stock">In stock: 3</div>
|
| 184 |
+
</article><article class="product" data-id="3"><h3>Product 3</h3>
|
| 185 |
+
<p class="description">This is product 3</p>
|
| 186 |
+
<span class="price">$15.99</span>
|
| 187 |
+
<div class="hidden stock">Out of stock</div>
|
| 188 |
+
</article>
|
| 189 |
+
</div>
|
| 190 |
+
</section>
|
| 191 |
+
'''
|
| 192 |
+
>>> section_element.path # All the ancestors in the DOM tree of this element
|
| 193 |
+
[<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>,
|
| 194 |
+
<data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>,
|
| 195 |
+
<data='<html><head><title>Complex Web Page</tit...'>]
|
| 196 |
+
>>> section_element.generate_css_selector
|
| 197 |
+
'#products'
|
| 198 |
+
>>> section_element.generate_full_css_selector
|
| 199 |
+
'body > main > #products > #products'
|
| 200 |
+
>>> section_element.generate_xpath_selector
|
| 201 |
+
"//*[@id='products']"
|
| 202 |
+
>>> section_element.generate_full_xpath_selector
|
| 203 |
+
"//body/main/*[@id='products']"
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
## Navigation
|
| 207 |
+
Using the elements we found above
|
| 208 |
+
|
| 209 |
+
```python
|
| 210 |
+
>>> section_element.parent
|
| 211 |
+
<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>
|
| 212 |
+
>>> section_element.parent.tag
|
| 213 |
+
'main'
|
| 214 |
+
>>> section_element.parent.parent.tag
|
| 215 |
+
'body'
|
| 216 |
+
>>> section_element.children
|
| 217 |
+
[<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>,
|
| 218 |
+
<data='<div class="product-list"> <article clas...' parent='<section id="products" schema='{"jsonabl...'>]
|
| 219 |
+
>>> section_element.siblings
|
| 220 |
+
[<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>]
|
| 221 |
+
>>> section_element.next # gets the next element, the same logic applies to `quote.previous`
|
| 222 |
+
<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>
|
| 223 |
+
>>> section_element.children.css('h2::text')
|
| 224 |
+
['Products']
|
| 225 |
+
>>> page.css_first('[data-id="1"]').has_class('product')
|
| 226 |
+
True
|
| 227 |
+
```
|
| 228 |
+
If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element like the one below
|
| 229 |
+
```python
|
| 230 |
+
for ancestor in quote.iterancestors():
|
| 231 |
+
# do something with it...
|
| 232 |
+
```
|
| 233 |
+
You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an `Adaptor` object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
|
| 234 |
+
```python
|
| 235 |
+
>>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav'))
|
| 236 |
+
<data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
## Fetching websites
|
| 240 |
+
Instead of passing the raw HTML to Scrapling, you can get a website's response directly through HTTP requests or by fetching it from browsers.
|
| 241 |
+
|
| 242 |
+
A fetcher is made for every use case.
|
| 243 |
+
|
| 244 |
+
### HTTP Requests
|
| 245 |
+
For simple HTTP requests, there's a `Fetcher` class that can be imported as below:
|
| 246 |
+
```python
|
| 247 |
+
from scrapling.fetchers import Fetcher
|
| 248 |
+
```
|
| 249 |
+
But that's class, so you will need to create an instance of the Fetcher first like this:
|
| 250 |
+
```python
|
| 251 |
+
from scrapling.fetchers import Fetcher
|
| 252 |
+
fetcher = Fetcher()
|
| 253 |
+
page = fetcher.get('https://httpbin.org/get')
|
| 254 |
+
```
|
| 255 |
+
This is intended, and you will find it with all fetchers because there are settings you can pass to `Fetcher()` initialization, but more on this later.
|
| 256 |
+
|
| 257 |
+
If you are going to use the default settings anyway, you can do this instead for a cleaner approach:
|
| 258 |
+
```python
|
| 259 |
+
from scrapling.fetchers import Fetcher
|
| 260 |
+
page = Fetcher.get('https://httpbin.org/get')
|
| 261 |
+
```
|
| 262 |
+
With that out of the way, here's how to do all HTTP methods:
|
| 263 |
+
```python
|
| 264 |
+
>>> from scrapling.fetchers import Fetcher
|
| 265 |
+
>>> page = Fetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
|
| 266 |
+
>>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
|
| 267 |
+
>>> page = Fetcher.put('https://httpbin.org/put', data={'key': 'value'})
|
| 268 |
+
>>> page = Fetcher.delete('https://httpbin.org/delete')
|
| 269 |
+
```
|
| 270 |
+
For Async requests, you will just replace the import like below:
|
| 271 |
+
```python
|
| 272 |
+
>>> from scrapling.fetchers import AsyncFetcher
|
| 273 |
+
>>> page = await AsyncFetcher.get('https://httpbin.org/get', stealthy_headers=True, follow_redirects=True)
|
| 274 |
+
>>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
|
| 275 |
+
>>> page = await AsyncFetcher.put('https://httpbin.org/put', data={'key': 'value'})
|
| 276 |
+
>>> page = await AsyncFetcher.delete('https://httpbin.org/delete')
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
> Note: You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from Google's search of this URL's domain. It's enabled by default.
|
| 280 |
+
|
| 281 |
+
This is just the tip of this fetcher; check the full page from [here](fetching/static.md)
|
| 282 |
+
|
| 283 |
+
### Dynamic loading
|
| 284 |
+
We have you covered if you deal with dynamic websites like most today!
|
| 285 |
+
|
| 286 |
+
The `PlayWrightFetcher` class provides many options to fetch/load websites' pages through browsers.
|
| 287 |
+
```python
|
| 288 |
+
>>> from scrapling.fetchers import PlayWrightFetcher
|
| 289 |
+
>>> page = PlayWrightFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
|
| 290 |
+
>>> page.css_first("#search a::attr(href)")
|
| 291 |
+
'https://github.com/D4Vinci/Scrapling'
|
| 292 |
+
>>> # The async version of fetch
|
| 293 |
+
>>> page = await PlayWrightFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True)
|
| 294 |
+
>>> page.css_first("#search a::attr(href)")
|
| 295 |
+
'https://github.com/D4Vinci/Scrapling'
|
| 296 |
+
```
|
| 297 |
+
It's named like that because it's built on top of [Playwright](https://playwright.dev/python/), and it currently provides 4 main run options that can be mixed as you want:
|
| 298 |
+
|
| 299 |
+
- Vanilla Playwright without any modifications other than the ones you chose.
|
| 300 |
+
- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode but bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode.
|
| 301 |
+
- Real browsers by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
|
| 302 |
+
- [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.
|
| 303 |
+
|
| 304 |
+
> Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
|
| 305 |
+
|
| 306 |
+
Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 307 |
+
|
| 308 |
+
### Dynamic anti-protection loading
|
| 309 |
+
We also have you covered if you deal with dynamic websites with annoying anti-protections!
|
| 310 |
+
|
| 311 |
+
The `StealthyFetcher` class uses a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), bypassing most anti-bot protections by default. Scrapling adds extra layers of flavors and configurations to further increase performance and undetectability.
|
| 312 |
+
```python
|
| 313 |
+
>>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
|
| 314 |
+
>>> page.status == 200
|
| 315 |
+
True
|
| 316 |
+
>>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments...
|
| 317 |
+
>>> # The async version of fetch
|
| 318 |
+
>>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')
|
| 319 |
+
>>> page.status == 200
|
| 320 |
+
True
|
| 321 |
+
```
|
| 322 |
+
> Note: All requests done by this fetcher are waited by default for all javascript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing 'network_idle=True', as you will see later.
|
| 323 |
+
|
| 324 |
+
Again, this is just the tip of this fetcher. Check out the full page from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 325 |
+
|
| 326 |
+
---
|
| 327 |
+
|
| 328 |
+
That's Scrapling at a glance. If you want to learn more about it, continue to the next section.
|
docs/parsing/automatch.md
ADDED
|
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Introduction
|
| 2 |
+
Auto-matching is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements.
|
| 3 |
+
|
| 4 |
+
Let's say you are scraping a page with a structure like this:
|
| 5 |
+
```html
|
| 6 |
+
<div class="container">
|
| 7 |
+
<section class="products">
|
| 8 |
+
<article class="product" id="p1">
|
| 9 |
+
<h3>Product 1</h3>
|
| 10 |
+
<p class="description">Description 1</p>
|
| 11 |
+
</article>
|
| 12 |
+
<article class="product" id="p2">
|
| 13 |
+
<h3>Product 2</h3>
|
| 14 |
+
<p class="description">Description 2</p>
|
| 15 |
+
</article>
|
| 16 |
+
</section>
|
| 17 |
+
</div>
|
| 18 |
+
```
|
| 19 |
+
And you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this
|
| 20 |
+
```python
|
| 21 |
+
page.css('#p1')
|
| 22 |
+
```
|
| 23 |
+
When website owners implement structural changes like
|
| 24 |
+
```html
|
| 25 |
+
<div class="new-container">
|
| 26 |
+
<div class="product-wrapper">
|
| 27 |
+
<section class="products">
|
| 28 |
+
<article class="product new-class" data-id="p1">
|
| 29 |
+
<div class="product-info">
|
| 30 |
+
<h3>Product 1</h3>
|
| 31 |
+
<p class="new-description">Description 1</p>
|
| 32 |
+
</div>
|
| 33 |
+
</article>
|
| 34 |
+
<article class="product new-class" data-id="p2">
|
| 35 |
+
<div class="product-info">
|
| 36 |
+
<h3>Product 2</h3>
|
| 37 |
+
<p class="new-description">Description 2</p>
|
| 38 |
+
</div>
|
| 39 |
+
</article>
|
| 40 |
+
</section>
|
| 41 |
+
</div>
|
| 42 |
+
</div>
|
| 43 |
+
```
|
| 44 |
+
The selector will no longer function, and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.
|
| 45 |
+
|
| 46 |
+
With Scrapling, you can enable the `automatch` feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element and without AI :)
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from scrapling import Adaptor, Fetcher
|
| 50 |
+
# Before the change
|
| 51 |
+
page = Adaptor(page_source, auto_match=True, url='example.com')
|
| 52 |
+
# or
|
| 53 |
+
Fetcher.auto_match = True
|
| 54 |
+
page = Fetcher.get('https://example.com')
|
| 55 |
+
# then
|
| 56 |
+
element = page.css('#p1' auto_save=True)
|
| 57 |
+
if not element: # One day website changes?
|
| 58 |
+
element = page.css('#p1', auto_match=True) # Scrapling still finds it!
|
| 59 |
+
# the rest of your code...
|
| 60 |
+
```
|
| 61 |
+
Below, I will show you one usage example for this feature. Then, we will dive deep into how to use it and provide details about this feature.
|
| 62 |
+
|
| 63 |
+
## Real-World Scenario
|
| 64 |
+
Let's use a real website as an example and use one of the fetchers to fetch its source. To do this, we need to find a website that will soon change its design/structure, take a copy of its source, and then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner, but that will make it a staged test, haha.
|
| 65 |
+
|
| 66 |
+
To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/); pretty old, eh?</br>Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector :)
|
| 67 |
+
|
| 68 |
+
If I want to extract the Questions button from the old design, I can use a selector like this: `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a` This selector is too specific because it was generated by Google Chrome.
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
Now, let's test the same selector in both versions
|
| 72 |
+
```python
|
| 73 |
+
>> from scrapling import Fetcher
|
| 74 |
+
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
|
| 75 |
+
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
|
| 76 |
+
>> new_url = "https://stackoverflow.com/"
|
| 77 |
+
>> Fetcher.configure(auto_match = True, automatch_domain='stackoverflow.com')
|
| 78 |
+
>>
|
| 79 |
+
>> page = Fetcher.get(old_url, timeout=30)
|
| 80 |
+
>> element1 = page.css_first(selector, auto_save=True)
|
| 81 |
+
>>
|
| 82 |
+
>> # Same selector but used in the updated website
|
| 83 |
+
>> page = Fetcher.get(new_url)
|
| 84 |
+
>> element2 = page.css_first(selector, auto_match=True)
|
| 85 |
+
>>
|
| 86 |
+
>> if element1.text == element2.text:
|
| 87 |
+
... print('Scrapling found the same element in the old and new designs!')
|
| 88 |
+
'Scrapling found the same element in the old and new designs!'
|
| 89 |
+
```
|
| 90 |
+
Note that I used a new argument called `automatch_domain`; this is because, for Scrapling, these are two different domains(`archive.org` and `stackoverflow.com`), so scrapling will isolate their `auto_match` data. To tell Scrapling they are the same website, we need to pass the custom domain we want to use while saving auto-match data for them both so Scrapling doesn't isolate them.
|
| 91 |
+
|
| 92 |
+
The code will be the same in a real-world scenario, except it will use the same URL for both requests, so you won't need to use the `automatch_domain` argument. This is the closest example I can give to real-world cases, so I hope it didn't confuse you :)
|
| 93 |
+
|
| 94 |
+
Hence, in the two examples above, I used both the `Adaptor` class and the `Fetcher` class to show you that the logic for automatch is the same.
|
| 95 |
+
|
| 96 |
+
## How the automatch feature works
|
| 97 |
+
Auto-matching works in two phases:
|
| 98 |
+
|
| 99 |
+
1. **Save Phase**: Store unique properties of elements
|
| 100 |
+
2. **Match Phase**: Find elements with similar properties later
|
| 101 |
+
|
| 102 |
+
Let's say you have an element you got through selection or any method and want the library to find it the next time you scrape this website, even if it had structural/design changes.
|
| 103 |
+
|
| 104 |
+
As little technical details as possible, the general logic goes as the following:
|
| 105 |
+
|
| 106 |
+
1. You tell Scrapling to save that element's unique properties in one of the ways we will show below.
|
| 107 |
+
2. Scrapling uses its configured database (SQLite by default) and saves each element's unique properties.
|
| 108 |
+
3. Now, because everything about the element can be changed or removed from the website's owner(s), nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things:
|
| 109 |
+
1. The domain of the current website. If you are using the `Adaptor` class, you should pass it while initializing the class, or if you are using one of the fetchers, the domain will be taken from the URL automatically.
|
| 110 |
+
2. An `identifier` to query that element's properties from the database. You don't always have to set the identifier yourself, as you will see later when we discuss this.
|
| 111 |
+
|
| 112 |
+
Together, they will be used to retrieve the element's unique properties from the database later.
|
| 113 |
+
|
| 114 |
+
4. Later, when the website structural changes, you tell Scrapling to automatch the element. Scrapling retrieves the element's unique properties and matches all elements on the page against the unique properties we already have for this element. A score is calculated for their similarity to the element we want. In that comparison, everything is taken into consideration, as you will see later
|
| 115 |
+
5. The element(s) with the highest similarity score to the wanted element are returned.
|
| 116 |
+
|
| 117 |
+
### The unique properties
|
| 118 |
+
You might wonder, if all aspects of an element can be removed or changed, what unique properties we are talking about.
|
| 119 |
+
|
| 120 |
+
For Scrapling, the unique elements we are relying on are:
|
| 121 |
+
|
| 122 |
+
- Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only).
|
| 123 |
+
- Element's parent tag name, attributes (names and values), and text.
|
| 124 |
+
|
| 125 |
+
But you need to understand that the comparison between elements is not exact; it's more about finding how similar these values are. So everything is considered, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now.
|
| 126 |
+
|
| 127 |
+
## How to use automatch feature
|
| 128 |
+
The automatch feature can be used on any element you have, and it's added as arguments to CSS/XPath Selection methods, as you saw above, but we will get back to that later.
|
| 129 |
+
|
| 130 |
+
First, you must enable the automatch feature by passing `auto_match=True` to the [Adaptor](main_classes.md#adaptor) class when you initialize it or enable it in the fetcher you are using of the available fetchers, as we will show.
|
| 131 |
+
|
| 132 |
+
Examples:
|
| 133 |
+
```python
|
| 134 |
+
>>> from scrapling import Adaptor, Fetcher
|
| 135 |
+
>>> page = Adaptor(html_doc, auto_match=True)
|
| 136 |
+
# OR
|
| 137 |
+
>>> Fetcher.auto_match = True
|
| 138 |
+
>>> page = Fetcher.fetch('https://example.com')
|
| 139 |
+
```
|
| 140 |
+
If you are using the [Adaptor](main_classes.md#adaptor) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.
|
| 141 |
+
|
| 142 |
+
If you didn't pass a URL, the word `default` will be used in place of the URL field while saving the element's unique properties. So, this will only be an issue if you used the same identifier later for a different website and didn't pass the URL parameter while initializing it. The save process will overwrite the previous data, and auto-matching only uses the latest saved properties.
|
| 143 |
+
|
| 144 |
+
Besides those arguments, we have `storage` and `storage_args`. Both are for the class to be used to connect to the database; by default, it's set to the SQLite class that the library is using. Those arguments shouldn't matter unless you want to write your own storage system, which we will cover on a [separate page in the development section](../development/automatch_storage_system.md).
|
| 145 |
+
|
| 146 |
+
Now, after enabling the automatch feature globally, you have two main ways to use it.
|
| 147 |
+
|
| 148 |
+
### The CSS/XPath Selection way
|
| 149 |
+
As you have seen in the example above, first, you have to use the `auto_save` argument while selecting an element that exists on the page like below
|
| 150 |
+
```python
|
| 151 |
+
element = page.css('#p1' auto_save=True)
|
| 152 |
+
```
|
| 153 |
+
and when the element doesn't exist, you can use the same selector and the `auto_match` argument, and the library will find it for you
|
| 154 |
+
```python
|
| 155 |
+
element = page.css('#p1', auto_match=True)
|
| 156 |
+
```
|
| 157 |
+
Pretty simple, eh?
|
| 158 |
+
|
| 159 |
+
Well, a lot happened under the hood here. Remember the identifier part we mentioned before that you need to set so you can retrieve the element you want? Here, with the `css`/`css_first`/`xpath`/`xpath_first` methods, the identifier is set automatically as the selector you passed here to make things easier :)
|
| 160 |
+
|
| 161 |
+
Also, that's why here, for all these methods, you can pass the `identifier` argument to set it yourself, and there are cases for this, or you can use it to save the properties with the `auto_save` argument.
|
| 162 |
+
|
| 163 |
+
### The manual way
|
| 164 |
+
You manually save and retrieve an element, then relocate it, which all happens within the automatch feature, as shown below. This allows you to automatch any element you have by any way or any selection method!
|
| 165 |
+
|
| 166 |
+
First, let's say you got an element like this by text:
|
| 167 |
+
```python
|
| 168 |
+
>>> element = page.find_by_text('Tipping the Velvet', first_match=True)
|
| 169 |
+
```
|
| 170 |
+
You can save its unique properties with the `save` method like below, but you must set the identifier yourself. For this example, I chose `my_special_element` as an identifier, but it's best to use a meaningful identifier in your code for the same reason you use meaningful variable names :)
|
| 171 |
+
```python
|
| 172 |
+
>>> page.save(element, 'my_special_element')
|
| 173 |
+
```
|
| 174 |
+
Now, later, when you want to retrieve it and relocate it inside the page with auto-matching, it would be like this
|
| 175 |
+
```python
|
| 176 |
+
>>> element_dict = page.retrieve('my_special_element')
|
| 177 |
+
>>> page.relocate(element_dict, adaptor_type=True)
|
| 178 |
+
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
|
| 179 |
+
>>> page.relocate(element_dict, adaptor_type=True).css('::text')
|
| 180 |
+
['Tipping the Velvet']
|
| 181 |
+
```
|
| 182 |
+
Hence, the `retrieve` and relocate` methods are used.
|
| 183 |
+
|
| 184 |
+
if you want to keep it as `lxml.etree` object, leave the `adaptor_type` argument
|
| 185 |
+
```python
|
| 186 |
+
>>> page.relocate(element_dict)
|
| 187 |
+
[<Element a at 0x105a2a7b0>]
|
| 188 |
+
```
|
| 189 |
+
|
| 190 |
+
## Troubleshooting
|
| 191 |
+
|
| 192 |
+
### No Matches Found
|
| 193 |
+
```python
|
| 194 |
+
# 1. Check if data was saved
|
| 195 |
+
element_data = page.retrieve('identifier')
|
| 196 |
+
if not element_data:
|
| 197 |
+
print("No data saved for this identifier")
|
| 198 |
+
|
| 199 |
+
# 2. Try with different identifier
|
| 200 |
+
products = page.css('.product', auto_match=True, identifier='old_selector')
|
| 201 |
+
|
| 202 |
+
# 3. Save again with new identifier
|
| 203 |
+
products = page.css('.new-product', auto_save=True, identifier='new_identifier')
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
### Wrong Elements Matched
|
| 207 |
+
```python
|
| 208 |
+
# Use more specific selectors
|
| 209 |
+
products = page.css('.product-list .product', auto_save=True)
|
| 210 |
+
|
| 211 |
+
# Or save with more context
|
| 212 |
+
product = page.find_by_text('Product Name').parent
|
| 213 |
+
page.save(product, 'specific_product')
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
## Known Issues
|
| 217 |
+
In the auto-matching save process, the unique properties of the first element from the selection results are the only ones that get saved. So if the selector you are using selects different elements on the page in different locations, auto-matching will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors get separated, and each selector gets executed alone.
|
| 218 |
+
|
| 219 |
+
## Final thoughts
|
| 220 |
+
Explaining this feature in detail without complications turned out to be challenging, but still, if there's something left unclear, you can head out to the [discussions section](https://github.com/D4Vinci/Scrapling/discussions), and I will reply to you ASAP or reach out to me privately and have a chat :)
|
docs/parsing/main_classes.md
ADDED
|
@@ -0,0 +1,539 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Introduction
|
| 2 |
+
After exploring the various ways to select elements with Scrapling and related features, Let's take a step back and examine the [Adaptor](#adaptor) class generally and other objects to better understand the parsing engine.
|
| 3 |
+
|
| 4 |
+
The [Adaptor](#adaptor) class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. You can always import it with any of the following imports
|
| 5 |
+
```python
|
| 6 |
+
from scrapling import Adaptor
|
| 7 |
+
from scrapling.parser import Adaptor
|
| 8 |
+
```
|
| 9 |
+
then use it directly as you already learned in the [overview](../overview.md) page
|
| 10 |
+
```python
|
| 11 |
+
adaptor = Adaptor(
|
| 12 |
+
text='<html>...</html>',
|
| 13 |
+
url='https://example.com'
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
# Then select elements as you like
|
| 17 |
+
elements = adaptor.css('.product')
|
| 18 |
+
```
|
| 19 |
+
In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, an [Adaptor](#adaptor) object. Any operation you do, like selection, navigation, etc., will return either an [Adaptor](#adaptor) object or an [Adaptors](#adaptors) object, given that the result is element/elements from the page, not text or similar.
|
| 20 |
+
|
| 21 |
+
In other words, the main page is a [Adaptor](#adaptor) object, and the elements within are [Adaptor](#adaptor) objects, and so on. Any text, such as the text content inside elements or the text inside element attributes, is a [TextHandler](#texthandler) object, and the attributes of each element are stored as [AttributesHandler](#attributeshandler). We will return to both objects later, so let's focus on the [Adaptor](#adaptor) object.
|
| 22 |
+
|
| 23 |
+
## Adaptor
|
| 24 |
+
### Arguments explained
|
| 25 |
+
The most important ones are `text` and `body`. Both are used to pass the HTML code you want to parse, but the first one accepts `str`, and the latter accepts `bytes` like how you used to do with `parsel` :)
|
| 26 |
+
|
| 27 |
+
Otherwise, you have the arguments `url`, `auto_match`, `storage`, and `storage_args`. All these arguments are settings used with the `auto_match` feature, and they don't make a difference if you are not going to use that feature, so just ignore them for now, and we will explain them in the [automatch](automatch.md) feature page.
|
| 28 |
+
|
| 29 |
+
Then you have the arguments for adjustments for parsing or adjusting/manipulating the HTML while the library parsing it:
|
| 30 |
+
|
| 31 |
+
- **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`.
|
| 32 |
+
- **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default, as it can mess up your scraping in many ways.
|
| 33 |
+
- **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML. This also means when you check for the raw html content, you will find it doesn't have the cdata.
|
| 34 |
+
|
| 35 |
+
I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed.
|
| 36 |
+
You may notice that I'm doing that a lot, and that's because it's something you don't need to know to use the library. The development section will cover these missing parts if you are that interested.
|
| 37 |
+
|
| 38 |
+
After that, for the main page and elements within, most properties don't get initialized until you use it like the text content of a page/element, and this is one of the reasons for Scrapling speed :)
|
| 39 |
+
|
| 40 |
+
### Properties
|
| 41 |
+
You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section.
|
| 42 |
+
|
| 43 |
+
Let's say we are parsing this HTML page for simplicity:
|
| 44 |
+
```html
|
| 45 |
+
<html>
|
| 46 |
+
<head>
|
| 47 |
+
<title>Some page</title>
|
| 48 |
+
</head>
|
| 49 |
+
<body>
|
| 50 |
+
<div class="product-list">
|
| 51 |
+
<article class="product" data-id="1">
|
| 52 |
+
<h3>Product 1</h3>
|
| 53 |
+
<p class="description">This is product 1</p>
|
| 54 |
+
<span class="price">$10.99</span>
|
| 55 |
+
<div class="hidden stock">In stock: 5</div>
|
| 56 |
+
</article>
|
| 57 |
+
|
| 58 |
+
<article class="product" data-id="2">
|
| 59 |
+
<h3>Product 2</h3>
|
| 60 |
+
<p class="description">This is product 2</p>
|
| 61 |
+
<span class="price">$20.99</span>
|
| 62 |
+
<div class="hidden stock">In stock: 3</div>
|
| 63 |
+
</article>
|
| 64 |
+
|
| 65 |
+
<article class="product" data-id="3">
|
| 66 |
+
<h3>Product 3</h3>
|
| 67 |
+
<p class="description">This is product 3</p>
|
| 68 |
+
<span class="price">$15.99</span>
|
| 69 |
+
<div class="hidden stock">Out of stock</div>
|
| 70 |
+
</article>
|
| 71 |
+
</div>
|
| 72 |
+
|
| 73 |
+
<script id="page-data" type="application/json">
|
| 74 |
+
{
|
| 75 |
+
"lastUpdated": "2024-09-22T10:30:00Z",
|
| 76 |
+
"totalProducts": 3
|
| 77 |
+
}
|
| 78 |
+
</script>
|
| 79 |
+
</body>
|
| 80 |
+
</html>
|
| 81 |
+
```
|
| 82 |
+
Load the page directly as shown before:
|
| 83 |
+
```python
|
| 84 |
+
from scrapling import Adaptor
|
| 85 |
+
page = Adaptor(html_doc)
|
| 86 |
+
```
|
| 87 |
+
Get all text content on the page recursively
|
| 88 |
+
```python
|
| 89 |
+
>>> page.get_all_text()
|
| 90 |
+
'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock'
|
| 91 |
+
```
|
| 92 |
+
Get the first article as explained before; we will use it as an example
|
| 93 |
+
```python
|
| 94 |
+
article = page.find('article')
|
| 95 |
+
```
|
| 96 |
+
With the same logic, get all text content on the element recursively
|
| 97 |
+
```python
|
| 98 |
+
>>> article.get_all_text()
|
| 99 |
+
'Product 1\nThis is product 1\n$10.99\nIn stock: 5'
|
| 100 |
+
```
|
| 101 |
+
But if you try to get the direct text content, it will be empty; notice the logic difference
|
| 102 |
+
```python
|
| 103 |
+
>>> article.text
|
| 104 |
+
''
|
| 105 |
+
```
|
| 106 |
+
The `get_all_text` method has the following optional arguments:
|
| 107 |
+
|
| 108 |
+
1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'
|
| 109 |
+
2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default.
|
| 110 |
+
3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results. The default is `('script', 'style',)`.
|
| 111 |
+
4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default
|
| 112 |
+
|
| 113 |
+
By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON, then use `.json()` on it
|
| 114 |
+
```python
|
| 115 |
+
>>> script = page.find('script')
|
| 116 |
+
>>> script.json()
|
| 117 |
+
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 118 |
+
```
|
| 119 |
+
Let's continue to get the element tag
|
| 120 |
+
```python
|
| 121 |
+
>>> article.tag
|
| 122 |
+
'article'
|
| 123 |
+
```
|
| 124 |
+
If you used it on the page directly, you will find you are operating on the root `html` element
|
| 125 |
+
```python
|
| 126 |
+
>>> page.tag
|
| 127 |
+
'html'
|
| 128 |
+
```
|
| 129 |
+
Now, I think I hammered the (`page`/`element`) idea, so I won't return to it again.
|
| 130 |
+
|
| 131 |
+
Getting the attributes of the element
|
| 132 |
+
```python
|
| 133 |
+
>>> print(article.attrib)
|
| 134 |
+
{'class': 'product', 'data-id': '1'}
|
| 135 |
+
```
|
| 136 |
+
Get the HTML content of the element
|
| 137 |
+
```python
|
| 138 |
+
>>> article.html_content
|
| 139 |
+
'<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
|
| 140 |
+
```
|
| 141 |
+
It's the same if you used the `.body` property
|
| 142 |
+
```python
|
| 143 |
+
>>> article.body
|
| 144 |
+
'<article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article>'
|
| 145 |
+
```
|
| 146 |
+
Get the prettified version of the HTML content of the element
|
| 147 |
+
```python
|
| 148 |
+
>>> print(article.prettify())
|
| 149 |
+
<article class="product" data-id="1"><h3>Product 1</h3>
|
| 150 |
+
<p class="description">This is product 1</p>
|
| 151 |
+
<span class="price">$10.99</span>
|
| 152 |
+
<div class="hidden stock">In stock: 5</div>
|
| 153 |
+
</article>
|
| 154 |
+
```
|
| 155 |
+
To get all the ancestors in the DOM tree of this element
|
| 156 |
+
```python
|
| 157 |
+
>>> article.path
|
| 158 |
+
[<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>,
|
| 159 |
+
<data='<body> <div class="product-list"> <artic...' parent='<html><head><title>Some page</title></he...'>,
|
| 160 |
+
<data='<html><head><title>Some page</title></he...'>]
|
| 161 |
+
```
|
| 162 |
+
Generate a CSS shortened selector if possible, or generate the full selector
|
| 163 |
+
```python
|
| 164 |
+
>>> article.generate_css_selector
|
| 165 |
+
'body > div > article'
|
| 166 |
+
>>> article.generate_full_css_selector
|
| 167 |
+
'body > div > article'
|
| 168 |
+
```
|
| 169 |
+
Same case with XPath
|
| 170 |
+
```python
|
| 171 |
+
>>> article.generate_xpath_selector
|
| 172 |
+
"//body/div/article"
|
| 173 |
+
>>> article.generate_full_xpath_selector
|
| 174 |
+
"//body/div/article"
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Traversal
|
| 178 |
+
Using the elements we found above, we will go over the properties/methods for moving in the page in detail.
|
| 179 |
+
|
| 180 |
+
If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online for a better understanding.
|
| 181 |
+
|
| 182 |
+
If you are too lazy to search about it, here's a quick explanation to give you a good idea.<br/>
|
| 183 |
+
Simply put, the `html` element is the root of the website's tree, as every page starts with an `html` element.<br/>
|
| 184 |
+
This element will be directly above elements like `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent." The element `body` is a "sibling" of the element `head` and vice versa.
|
| 185 |
+
|
| 186 |
+
Accessing the parent of an element
|
| 187 |
+
```python
|
| 188 |
+
>>> article.parent
|
| 189 |
+
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
| 190 |
+
>>> article.parent.tag
|
| 191 |
+
'div'
|
| 192 |
+
```
|
| 193 |
+
You can chain it as you want, which applies to all similar properties/methods we will review.
|
| 194 |
+
```python
|
| 195 |
+
>>> article.parent.parent.tag
|
| 196 |
+
'body'
|
| 197 |
+
```
|
| 198 |
+
Get the children of an element
|
| 199 |
+
```python
|
| 200 |
+
>>> article.children
|
| 201 |
+
[<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
|
| 202 |
+
<data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
|
| 203 |
+
<data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
|
| 204 |
+
<data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
|
| 205 |
+
```
|
| 206 |
+
Get all elements underneath an element. It acts as a nested version of the `children` property
|
| 207 |
+
```python
|
| 208 |
+
>>> article.below_elements
|
| 209 |
+
[<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
|
| 210 |
+
<data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
|
| 211 |
+
<data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
|
| 212 |
+
<data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>]
|
| 213 |
+
```
|
| 214 |
+
This element returns the same result as the `children` property because its children don't have children.
|
| 215 |
+
|
| 216 |
+
Another example of using the element with the `product-list` class will clear the difference between the `children` property and the `below_elements` property
|
| 217 |
+
```python
|
| 218 |
+
>>> products_list = page.css_first('.product-list')
|
| 219 |
+
>>> products_list.children
|
| 220 |
+
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 221 |
+
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 222 |
+
<data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
| 223 |
+
|
| 224 |
+
>>> products_list.below_elements
|
| 225 |
+
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 226 |
+
<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>,
|
| 227 |
+
<data='<p class="description">This is product 1...' parent='<article class="product" data-id="1"><h3...'>,
|
| 228 |
+
<data='<span class="price">$10.99</span>' parent='<article class="product" data-id="1"><h3...'>,
|
| 229 |
+
<data='<div class="hidden stock">In stock: 5</d...' parent='<article class="product" data-id="1"><h3...'>,
|
| 230 |
+
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 231 |
+
...]
|
| 232 |
+
```
|
| 233 |
+
Get the siblings of an element
|
| 234 |
+
```python
|
| 235 |
+
>>> article.siblings
|
| 236 |
+
[<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 237 |
+
<data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
| 238 |
+
```
|
| 239 |
+
Get the next element of the current element
|
| 240 |
+
```python
|
| 241 |
+
>>> article.next # gets the next element, the same logic applies to `quote.previous`
|
| 242 |
+
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 243 |
+
```
|
| 244 |
+
The same logic applies to the `previous` property
|
| 245 |
+
```python
|
| 246 |
+
>>> article.previous # It's the first child, so it doesn't have a previous element
|
| 247 |
+
>>> second_article = page.css_first('.product[data-id="2"]')
|
| 248 |
+
>>> second_article.previous
|
| 249 |
+
<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>
|
| 250 |
+
```
|
| 251 |
+
You can check easily and pretty fast if an element has a specific class name or not
|
| 252 |
+
```python
|
| 253 |
+
>>> article.has_class('product')
|
| 254 |
+
True
|
| 255 |
+
```
|
| 256 |
+
If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the example below
|
| 257 |
+
```python
|
| 258 |
+
for ancestor in article.iterancestors():
|
| 259 |
+
# do something with it...
|
| 260 |
+
```
|
| 261 |
+
You can search for a specific ancestor of an element that satisfies a function; all you need to do is to pass a function that takes an [Adaptor](#adaptor) object as an argument and return `True` if the condition satisfies or `False` otherwise like below:
|
| 262 |
+
```python
|
| 263 |
+
>>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
|
| 264 |
+
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
| 265 |
+
|
| 266 |
+
>>> article.find_ancestor(lambda ancestor: ancestor.css('.product-list')) # Same result, different approach
|
| 267 |
+
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
| 268 |
+
```
|
| 269 |
+
## Adaptors
|
| 270 |
+
The class `Adaptors` is the "List" version of the [Adaptor](#adaptor) class. It inherits from the Python standard `List` type, so it shares all `List` properties and methods while adding more methods to make the operations you want to execute on the [Adaptor](#adaptor) instances within more straightforward.
|
| 271 |
+
|
| 272 |
+
In the [Adaptor](#adaptor) class, all methods/properties that should return a group of elements return them as an [Adaptors](#adaptors) class instance. The only exceptions are when you use the CSS/XPath methods as follows:
|
| 273 |
+
|
| 274 |
+
- If you selected a text node with the selector, then the return type will be [TextHandler](#texthandler)/[TextHandlers](#texthandlers). <br/>Examples:
|
| 275 |
+
```python
|
| 276 |
+
>>> page.css('a::text') # -> TextHandlers
|
| 277 |
+
>>> page.xpath('//a/text()') # -> TextHandlers
|
| 278 |
+
>>> page.css_first('a::text') # -> TextHandler
|
| 279 |
+
>>> page.xpath_first('//a/text()') # -> TextHandler
|
| 280 |
+
>>> page.css('a::attr(href)') # -> TextHandlers
|
| 281 |
+
>>> page.xpath('//a/@href') # -> TextHandlers
|
| 282 |
+
>>> page.css_first('a::attr(href)') # -> TextHandler
|
| 283 |
+
>>> page.xpath_first('//a/@href') # -> TextHandler
|
| 284 |
+
```
|
| 285 |
+
- If you used a combined selector that returns mixed types, the result will be a Python standard `List`. <br/>Examples:
|
| 286 |
+
```python
|
| 287 |
+
>>> page.css('.price_color') # -> Adaptors
|
| 288 |
+
>>> page.css('.product_pod a::attr(href)') # -> TextHandlers
|
| 289 |
+
>>> page.css('.price_color, .product_pod a::attr(href)') # -> List
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
Let's see what [Adaptors](#adaptors) class adds to the table with that out of the way.
|
| 293 |
+
### Properties
|
| 294 |
+
Apart from the normal operations on Python lists like iteration, slicing, etc...
|
| 295 |
+
|
| 296 |
+
You can do the following:
|
| 297 |
+
|
| 298 |
+
Execute CSS and XPath selectors directly on the [Adaptor](#adaptor) instances it has while the arguments and the return types are the same as [Adaptor](#adaptor)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
|
| 299 |
+
```python
|
| 300 |
+
>>> page.css('.product_pod a')
|
| 301 |
+
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
| 302 |
+
<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
|
| 303 |
+
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
|
| 304 |
+
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
|
| 305 |
+
<data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
|
| 306 |
+
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 307 |
+
...]
|
| 308 |
+
|
| 309 |
+
>>> page.css('.product_pod').css('a') # Returns the same result
|
| 310 |
+
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
| 311 |
+
<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
|
| 312 |
+
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<div class="image_container"> <a href="c...'>,
|
| 313 |
+
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>,
|
| 314 |
+
<data='<a href="catalogue/soumission_998/index....' parent='<div class="image_container"> <a href="c...'>,
|
| 315 |
+
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 316 |
+
...]
|
| 317 |
+
```
|
| 318 |
+
Run the `re` and `re_first` methods directly. They take the same arguments passed as the [Adaptor](#adaptor) class. I'm still leaving these methods to be explained in the [TextHandler](#texthandler) section below.
|
| 319 |
+
|
| 320 |
+
However, in this class, the `re_first` behaves differently as it runs `re` on each [Adaptor](#adaptor) within and returns the first one with a result. The `re` method will return a [TextHandlers](#texthandlers) object as normal that has all the results combined in one [TextHandlers](#texthandlers) instance.
|
| 321 |
+
```python
|
| 322 |
+
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 323 |
+
['51.77',
|
| 324 |
+
'53.74',
|
| 325 |
+
'50.10',
|
| 326 |
+
'47.82',
|
| 327 |
+
'54.23',
|
| 328 |
+
...]
|
| 329 |
+
|
| 330 |
+
>>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
|
| 331 |
+
['a-light-in-the-attic_1000',
|
| 332 |
+
'tipping-the-velvet_999',
|
| 333 |
+
'soumission_998',
|
| 334 |
+
'sharp-objects_997',
|
| 335 |
+
...]
|
| 336 |
+
```
|
| 337 |
+
With the `search` method, you can search quickly in the available [Adaptor](#adaptor) classes. The function you pass must accept an [Adaptor](#adaptor) instance as the first argument and return True/False. The method will return the first [Adaptor](#adaptor) instance that satisfies the function; otherwise, it will return `None`.
|
| 338 |
+
```python
|
| 339 |
+
# Find all the products with price '53.23'
|
| 340 |
+
>>> search_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) == 54.23
|
| 341 |
+
>>> page.css('.product_pod').search(search_function)
|
| 342 |
+
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>
|
| 343 |
+
```
|
| 344 |
+
You can use the `filter` method, too, which takes a function like the `search` method but returns an `Adaptors` instance of all the [Adaptor](#adaptor) classes that satisfy the function
|
| 345 |
+
```python
|
| 346 |
+
# Find all products with prices over $50
|
| 347 |
+
>>> filtering_function = lambda p: float(p.css('.price_color').re_first(r'[\d\.]+')) > 50
|
| 348 |
+
>>> page.css('.product_pod').filter(filtering_function)
|
| 349 |
+
[<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 350 |
+
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 351 |
+
<data='<article class="product_pod"><div class=...' parent='<li class="col-xs-6 col-sm-4 col-md-3 co...'>,
|
| 352 |
+
...]
|
| 353 |
+
```
|
| 354 |
+
|
| 355 |
+
## TextHandler
|
| 356 |
+
This class is mandatory to understand, as all methods/properties that should return a string for you will return `TextHandler`, and the ones that should return a list of strings will return [TextHandlers](#texthandlers) instead.
|
| 357 |
+
|
| 358 |
+
TextHandler is a subclass of the standard Python string, so you can do anything with it. So, what is the difference that requires a different naming?
|
| 359 |
+
|
| 360 |
+
Of course, TextHandler provides extra methods and properties that the standard Python strings can't do. We will review them now, but remember that all methods and properties in all classes that return string(s) are returning TextHandler, which opens the door for creativity and makes the code shorter and cleaner, as you will see. Also, you can import it directly and use it on any string, which we will explain later.
|
| 361 |
+
### Usage
|
| 362 |
+
First, before discussing the added methods, you need to know that all operations on it, like slicing, accessing by index, etc., and methods like `split`, `replace`, `strip`, etc., all return a TextHandler again, so you can chain them as you want. If you find a method or property that returns a standard string instead of TextHandler, please open an issue, and we will override it as well.
|
| 363 |
+
|
| 364 |
+
First, we start with the `re` and `re_first` methods. These are the same methods that exist in the rest of the classes ([Adaptor](#adaptor), [Adaptors](#adaptors), and [TextHandlers](#texthandlers)), so they will take the same arguments as well.
|
| 365 |
+
|
| 366 |
+
The `re` method takes a string/compiled regex pattern as the first argument. It searches the data for all strings matching the regex and returns them as a [TextHandlers](#texthandlers) instance. The `re_first` method takes the same arguments and behaves similarly, but as you probably figured out from the naming, it returns the first result only as a `TextHandler` instance.
|
| 367 |
+
|
| 368 |
+
Also, it takes other helpful arguments, which are:
|
| 369 |
+
|
| 370 |
+
- **replace_entities**: This is enabled by default. It replaces character entity references with their corresponding characters.
|
| 371 |
+
- **clean_match**: It's disabled by default. This makes the method ignore all whitespaces and consecutive spaces while matching.
|
| 372 |
+
- **case_sensitive**: It's enabled by default. As the name implies, disabling it will make the regex ignore letters case while compiling it.
|
| 373 |
+
|
| 374 |
+
You have seen these examples before; the return result is [TextHandlers](#texthandlers) because we used the `re` method.
|
| 375 |
+
```python
|
| 376 |
+
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 377 |
+
['51.77',
|
| 378 |
+
'53.74',
|
| 379 |
+
'50.10',
|
| 380 |
+
'47.82',
|
| 381 |
+
'54.23',
|
| 382 |
+
...]
|
| 383 |
+
|
| 384 |
+
>>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
|
| 385 |
+
['a-light-in-the-attic_1000',
|
| 386 |
+
'tipping-the-velvet_999',
|
| 387 |
+
'soumission_998',
|
| 388 |
+
'sharp-objects_997',
|
| 389 |
+
...]
|
| 390 |
+
```
|
| 391 |
+
To explain the other arguments better, we will use a custom string for each example below
|
| 392 |
+
```python
|
| 393 |
+
>>> from scrapling import TextHandler
|
| 394 |
+
>>> test_string = TextHandler('hi there') # Hence the two spaces
|
| 395 |
+
>>> test_string.re('hi there')
|
| 396 |
+
>>> test_string.re('hi there', clean_match=True) # Using `clean_match` will clean the string before matching the regex
|
| 397 |
+
['hi there']
|
| 398 |
+
|
| 399 |
+
>>> test_string2 = TextHandler('Oh, Hi Mark')
|
| 400 |
+
>>> test_string2.re_first('oh, hi Mark')
|
| 401 |
+
>>> test_string2.re_first('oh, hi Mark', case_sensitive=False) # Hence disabling `case_sensitive`
|
| 402 |
+
'Oh, Hi Mark'
|
| 403 |
+
|
| 404 |
+
# Mixing arguments
|
| 405 |
+
>>> test_string.re('hi there', clean_match=True, case_sensitive=False)
|
| 406 |
+
['hi There']
|
| 407 |
+
```
|
| 408 |
+
Another use of the idea of replacing strings with `TextHandler` everywhere is a property like `html_content` returns `TextHandler` so you can do regex on the HTML content if you want:
|
| 409 |
+
```python
|
| 410 |
+
>>> page.html_content.re('div class=".*">(.*)</div')
|
| 411 |
+
['In stock: 5', 'In stock: 3', 'Out of stock']
|
| 412 |
+
```
|
| 413 |
+
|
| 414 |
+
- You also have the `.json()` method, which tries to convert the content to a json object quickly if possible; otherwise, it throws an error
|
| 415 |
+
```python
|
| 416 |
+
>>> page.css_first('#page-data::text')
|
| 417 |
+
'\n {\n "lastUpdated": "2024-09-22T10:30:00Z",\n "totalProducts": 3\n }\n '
|
| 418 |
+
>>> page.css_first('#page-data::text').json()
|
| 419 |
+
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 420 |
+
```
|
| 421 |
+
Hence, if you didn't specify a text node while selecting an element (like the text content or an attribute text content), the text content will be selected automatically like this
|
| 422 |
+
```python
|
| 423 |
+
>>> page.css_first('#page-data').json()
|
| 424 |
+
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 425 |
+
```
|
| 426 |
+
The [Adaptor](#adaptor) class adds one thing here, too; let's say this is the page we are working with:
|
| 427 |
+
```html
|
| 428 |
+
<html>
|
| 429 |
+
<body>
|
| 430 |
+
<div>
|
| 431 |
+
<script id="page-data" type="application/json">
|
| 432 |
+
{
|
| 433 |
+
"lastUpdated": "2024-09-22T10:30:00Z",
|
| 434 |
+
"totalProducts": 3
|
| 435 |
+
}
|
| 436 |
+
</script>
|
| 437 |
+
</div>
|
| 438 |
+
</body>
|
| 439 |
+
</html>
|
| 440 |
+
```
|
| 441 |
+
The [Adaptor](#adaptor) class has the `get_all_text` method, which you should be aware of by now. This method returns a `TextHandler`, of course.<br/><br/>
|
| 442 |
+
So, as you know here, if you did something like this
|
| 443 |
+
```python
|
| 444 |
+
>>> page.css_first('div::text').json()
|
| 445 |
+
```
|
| 446 |
+
You will get an error because the `div` tag doesn't have direct text content that can be serialized to JSON; it actually doesn't have text content at all.<br/><br/>
|
| 447 |
+
In this case, the `get_all_text` method comes to the rescue, so you can do something like that
|
| 448 |
+
```python
|
| 449 |
+
>>> page.css_first('div').get_all_text(ignore_tags=[]).json()
|
| 450 |
+
{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3}
|
| 451 |
+
```
|
| 452 |
+
I used the `ignore_tags` argument here because the default value of it is `('script', 'style',)`, as you are aware.<br/><br/>
|
| 453 |
+
Another related behavior you should be aware of is the case while using any of the fetchers, which we will explain later. If you have a JSON response like this example:
|
| 454 |
+
```python
|
| 455 |
+
>>> page = Adaptor("""{"some_key": "some_value"}""")
|
| 456 |
+
```
|
| 457 |
+
Because the [Adaptor](#adaptor) class is optimized to deal with HTML pages, it will deal with it as a broken HTML response and fix it, so if you used the `html_content` property, you get this
|
| 458 |
+
```python
|
| 459 |
+
>>> page.html_content
|
| 460 |
+
'<html><body><p>{"some_key": "some_value"}</p></body></html>'
|
| 461 |
+
```
|
| 462 |
+
Here, you can use `json` method directly, and it will work
|
| 463 |
+
```python
|
| 464 |
+
>>> page.json()
|
| 465 |
+
{'some_key': 'some_value'}
|
| 466 |
+
```
|
| 467 |
+
You might wonder how this happened while the `html` tag lacks direct text?<br/>
|
| 468 |
+
Well, for these cases like JSON responses, I made the `.json()` method inside the [Adaptor](#adaptor) class to check if the current element doesn't have text content; it will use the `get_all_text` method directly.<br/><br/>It might sound hacky a bit but remember, Scrapling is currently optimized to work with HTML pages only so that's the best way till now to handle JSON responses currently without sacrificing speed. This will be changed in the upcoming versions.
|
| 469 |
+
|
| 470 |
+
- Another handy method is `.clean()`, this will remove all white spaces and consecutive spaces for you and return a new `TextHandler`, wonderful
|
| 471 |
+
```python
|
| 472 |
+
>>> TextHandler('\n wonderful idea, \reh?').clean()
|
| 473 |
+
'wonderful idea, eh?'
|
| 474 |
+
```
|
| 475 |
+
|
| 476 |
+
- Another method that might be helpful in some cases is the `.sort()` method to sort the string for you as you do with lists
|
| 477 |
+
```python
|
| 478 |
+
>>> TextHandler('acb').sort()
|
| 479 |
+
'abc'
|
| 480 |
+
```
|
| 481 |
+
Or do it in reverse:
|
| 482 |
+
```python
|
| 483 |
+
>>> TextHandler('acb').sort(reverse=True)
|
| 484 |
+
'cba'
|
| 485 |
+
```
|
| 486 |
+
|
| 487 |
+
Other methods and properties will be added over time, but remember that this class is returned in place of strings nearly everywhere in the library.
|
| 488 |
+
|
| 489 |
+
## TextHandlers
|
| 490 |
+
You probably guessed it: This class is similar to [Adaptors](#adaptors) and [Adaptor](#adaptor), but here it inherits the same logic and method as standard lists, with only `re` and `re_first` as new methods.
|
| 491 |
+
|
| 492 |
+
The only difference is that the `re_first` method logic here does `re` on each [TextHandler](#texthandler) within and returns the first result it has or `None`. Nothing is new to explain here, but new methods will be added here with time.
|
| 493 |
+
|
| 494 |
+
## AttributesHandler
|
| 495 |
+
This is a read-only version of Python's standard dictionary or `dict` that's only used to store the attributes of each element or each [Adaptor](#adaptor) instance, in other words.
|
| 496 |
+
```python
|
| 497 |
+
>>> print(page.find('script').attrib)
|
| 498 |
+
{'id': 'page-data', 'type': 'application/json'}
|
| 499 |
+
>>> type(page.find('script').attrib).__name__
|
| 500 |
+
'AttributesHandler'
|
| 501 |
+
```
|
| 502 |
+
Because it's read-only, it will use fewer resources than the standard dictionary. Still, it has the same dictionary method/properties other than those allowing you to modify/override the data.
|
| 503 |
+
|
| 504 |
+
It currently adds two extra simple methods:
|
| 505 |
+
|
| 506 |
+
- The `search_values` method
|
| 507 |
+
|
| 508 |
+
In standard dictionaries, you can do `dict.get("key_name")` to check if a key exists. However, if you want to search by values instead of keys, it will take you some code lines. This method does that for you. It allows you to search the current attributes by values and returns a dictionary of each matching item.
|
| 509 |
+
|
| 510 |
+
A simple example would be
|
| 511 |
+
```python
|
| 512 |
+
>>> for i in page.find('script').attrib.search_values('page-data'):
|
| 513 |
+
print(i)
|
| 514 |
+
{'id': 'page-data'}
|
| 515 |
+
```
|
| 516 |
+
But this method provides the `partial` argument as well, which allows you to search by part of the value:
|
| 517 |
+
```python
|
| 518 |
+
>>> for i in page.find('script').attrib.search_values('page', partial=True):
|
| 519 |
+
print(i)
|
| 520 |
+
{'id': 'page-data'}
|
| 521 |
+
```
|
| 522 |
+
These examples won't happen in the real world; most likely, a more real-world example would be using it with the `find_all` method to find all elements that have a specific value in their arguments:
|
| 523 |
+
```python
|
| 524 |
+
>>> page.find_all(lambda element: list(element.attrib.search_values('product')))
|
| 525 |
+
[<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 526 |
+
<data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>,
|
| 527 |
+
<data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>]
|
| 528 |
+
```
|
| 529 |
+
All these elements have 'product' as a value for the attribute `class`.
|
| 530 |
+
|
| 531 |
+
Hence, I used the `list` function here because `search_values` returns a generator, so it would be `True` for all elements.
|
| 532 |
+
|
| 533 |
+
- The `json_string` property
|
| 534 |
+
|
| 535 |
+
This property converts current attributes to JSON string if the attributes are JSON serializable; otherwise, it throws an error
|
| 536 |
+
```python
|
| 537 |
+
>>> page.find('script').attrib.json_string
|
| 538 |
+
b'{"id":"page-data","type":"application/json"}'
|
| 539 |
+
```
|
docs/parsing/selection.md
ADDED
|
@@ -0,0 +1,512 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Introduction
|
| 2 |
+
Scrapling currently supports parsing HTML pages exclusively, so it doesn't support XML feeds. This decision was made because the automatch feature won't work with XML, but that might change soon, so stay tuned :)
|
| 3 |
+
|
| 4 |
+
In Scrapling, there are 5 main ways to find elements:
|
| 5 |
+
|
| 6 |
+
1. CSS3 Selectors
|
| 7 |
+
2. XPath Selectors
|
| 8 |
+
3. Finding elements based on filters/conditions.
|
| 9 |
+
4. Finding elements whose content contains specific text
|
| 10 |
+
5. Finding elements whose content matches specific regex
|
| 11 |
+
|
| 12 |
+
Of course, there are other indirect ways to find elements with Scrapling, but here we will discuss the main ways in detail. We will also bring up one of the most remarkable features of Scrapling: the ability to find elements that are similar to the element you have; you can jump to that section directly from [here](#finding-similar-elements).
|
| 13 |
+
|
| 14 |
+
If you are new to Web Scraping, have little to no experience writing selectors, and want to start quickly, I recommend you jump directly to learning the `find`/`find_all` methods from [here](#filters-based-searching).
|
| 15 |
+
|
| 16 |
+
## CSS/XPath selectors
|
| 17 |
+
|
| 18 |
+
### What are CSS selectors?
|
| 19 |
+
[CSS](https://en.wikipedia.org/wiki/CSS) is a language for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements.
|
| 20 |
+
|
| 21 |
+
Scrapling implements CSS3 selectors as described in the [W3C specification](http://www.w3.org/TR/2011/REC-css3-selectors-20110929/). CSS selectors support comes from cssselect, so it's better to read about which [selectors are supported from cssselect](https://cssselect.readthedocs.io/en/latest/#supported-selectors) and pseudo-functions/elements.
|
| 22 |
+
|
| 23 |
+
Also, Scrapling implements some non-standard pseudo-elements like:
|
| 24 |
+
|
| 25 |
+
* To select text nodes, use ``::text``
|
| 26 |
+
* To select attribute values, use ``::attr(name)`` where name is the name of the attribute that you want the value of
|
| 27 |
+
|
| 28 |
+
In short, if you come from Scrapy/Parsel, you will find the same logic for selectors here to make it easier. No need to implement a stranger logic to the one that most of us are used to :)
|
| 29 |
+
|
| 30 |
+
To select elements with CSS selectors, you have the `css` and `css_first` methods. The latter is useful when you are interested in the first element it finds only, or if it's one element, etc., and the first when it's more than one, as it returns `Adaptors`.
|
| 31 |
+
|
| 32 |
+
### What are XPath selectors?
|
| 33 |
+
[XPath](https://en.wikipedia.org/wiki/XPath) is a language for selecting nodes in XML documents, which can also be used with HTML. This [cheatsheet] (https://devhints.io/xpath) is a good resource for learning about [XPath](https://en.wikipedia.org/wiki/XPath). Scrapling adds XPath selectors directly through LXML.
|
| 34 |
+
|
| 35 |
+
In short, it is the same situation as CSS Selectors; if you come from Scrapy/Parsel, you will find the same logic for selectors here. BUT Scrapling doesn't implement the XPath extension function `has-class` as Scrapy/Parsel—instead, there's the `has_class` method that you can use on elements returned for the same purpose.
|
| 36 |
+
|
| 37 |
+
To select elements with XPath selectors, you have the `xpath` and `xpath_first` methods. Again, these methods follow the same logic as the CSS selectors methods above.
|
| 38 |
+
|
| 39 |
+
> Note that each method of `css`, `css_first`, `xpath`, and `xpath_first` have additional arguments, but we didn't explain them here as they are all about the automatch feature. The automatch feature will have its page later to be described in detail.
|
| 40 |
+
|
| 41 |
+
### Selectors examples
|
| 42 |
+
Let's see some shared examples of using CSS and XPath Selectors.
|
| 43 |
+
|
| 44 |
+
Select all elements with the class `product`
|
| 45 |
+
```python
|
| 46 |
+
products = page.css('.product')
|
| 47 |
+
products = page.xpath('//*[@class="product"]')
|
| 48 |
+
```
|
| 49 |
+
Note: The XPath one won't be accurate if there's another class; better rely on CSS for selecting by class
|
| 50 |
+
|
| 51 |
+
Select the first element with the class `product`
|
| 52 |
+
```python
|
| 53 |
+
product = page.css_first('.product')
|
| 54 |
+
product = page.xpath_first('//*[@class="product"]')
|
| 55 |
+
```
|
| 56 |
+
Which would be the same as doing
|
| 57 |
+
```python
|
| 58 |
+
product = page.css('.product')[0]
|
| 59 |
+
product = page.xpath('//*[@class="product"]')[0]
|
| 60 |
+
```
|
| 61 |
+
Get the text of the first element with the `h1` tag name
|
| 62 |
+
```python
|
| 63 |
+
title = page.css_first('h1::text')
|
| 64 |
+
title = page.xpath_first('//h1//text()')
|
| 65 |
+
```
|
| 66 |
+
Which is again the same as doing
|
| 67 |
+
```python
|
| 68 |
+
title = page.css_first('h1').text
|
| 69 |
+
title = page.xpath_first('//h1').text
|
| 70 |
+
```
|
| 71 |
+
Get the `href` attribute of the first element with `a` tag name
|
| 72 |
+
```python
|
| 73 |
+
link = page.css_first('a::attr(href)')
|
| 74 |
+
link = page.xpath_first('//a/@href')
|
| 75 |
+
```
|
| 76 |
+
Select the text of the first element with the `h1` tag name, which contains 'Phone' and under an element with class 'product'
|
| 77 |
+
```python
|
| 78 |
+
title = page.css_first('.product h1:contains("Phone")::text')
|
| 79 |
+
title = page.page.xpath_first('//*[@class="product"]//h1[contains(text(),"Phone")]/text()')
|
| 80 |
+
```
|
| 81 |
+
You can nest and chain selectors as you want, given that it returns results
|
| 82 |
+
```python
|
| 83 |
+
page.css_first('.product').css_first('h1:contains("Phone")::text')
|
| 84 |
+
page.xpath_first('//*[@class="product"]').xpath_first('//h1[contains(text(),"Phone")]/text()')
|
| 85 |
+
page.xpath_first('//*[@class="product"]').css_first('h1:contains("Phone")::text')
|
| 86 |
+
```
|
| 87 |
+
Another example
|
| 88 |
+
|
| 89 |
+
All links that have 'image' in their 'href' attribute
|
| 90 |
+
```python
|
| 91 |
+
links = page.css('a[href*="image"]')
|
| 92 |
+
links = page.xpath('//a[contains(@href, "image")]')
|
| 93 |
+
for index, link in enumerate(links):
|
| 94 |
+
link_value = link.attrib['href'] # Cleaner than link.css('::attr(href)')
|
| 95 |
+
link_text = link.text
|
| 96 |
+
print(f'Link number {index} points to this url {link_value} with text content as "{link_text}"')
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
## Text-content selection
|
| 100 |
+
Scrapling provides the ability to select elements based on their direct text content, and you have two ways to do this:
|
| 101 |
+
|
| 102 |
+
1. Elements whose direct text content contains given text with many options through the `find_by_text` method.
|
| 103 |
+
2. Elements whose direct text content matches the given regex pattern with many options through the `find_by_regex` method.
|
| 104 |
+
|
| 105 |
+
What you can do with `find_by_text` can be done with `find_by_regex` if you are good enough with regular expressions (regex), but we are providing more options to make them easier for all users to access.
|
| 106 |
+
|
| 107 |
+
With `find_by_text`, you will pass the text as the first argument; with the `find_by_regex` method, the regex pattern is the first. Both methods share the following arguments:
|
| 108 |
+
|
| 109 |
+
* **first_match**: If `True` (the default), the method used will return the first result it finds.
|
| 110 |
+
* **case_sensitive**: If `True`, the case of the letters will be considered.
|
| 111 |
+
* **clean_match**: If `True`, all whitespaces and consecutive spaces will be ignored while matching.
|
| 112 |
+
|
| 113 |
+
By default, Scrapling search for exact matching for the text you pass to `find_by_text`, so the text content of the wanted element have to be ONLY the text you inputted, but that's why it also has one extra argument, which is:
|
| 114 |
+
|
| 115 |
+
* **partial**: If enabled, `find_by_text` will return elements that contain the input text. So it's not an exact match anymore
|
| 116 |
+
|
| 117 |
+
Note: The method `find_by_regex` can accept both regular strings and a compiled regex pattern as its first argument, as you will see in the upcoming examples.
|
| 118 |
+
|
| 119 |
+
### Finding Similar Elements
|
| 120 |
+
One of the most remarkable new features that Scrapling puts on the table is the feature that allows the user to tell Scrapling to find elements similar to the element at hand. This feature inspiration came from the AutoScraper library, but here, it can be used on elements found by any method. Most likely, most of its usage would be after finding elements through text content like how AutoScraper works, so it would also be convenient to explain it here.
|
| 121 |
+
|
| 122 |
+
So, how does it work?
|
| 123 |
+
|
| 124 |
+
Imagine a scenario where you found a product by its title, for example, and you want to extract other products listed in the same table/container. With the element you have, you can simply call the method `.find_similar()` on it, and Scrapling will:
|
| 125 |
+
|
| 126 |
+
1. Find all page elements with the same tree depth as this element.
|
| 127 |
+
2. All found elements will be checked, and those without the same tag name, parent tag name, and grandparent tag name will be dropped.
|
| 128 |
+
3. Now we are sure (like 99% sure) that these elements are the ones we want, but as a last check, Scrapling will use fuzzy matching to drop the elements whose attributes don't look like the attributes of our element. There's a percentage to control this step, and I recommend you not play with it unless the default settings don't get the elements you want.
|
| 129 |
+
|
| 130 |
+
That's a lot of talking, I know, but I had to go deep, I will give examples of using this method in the next section, but first, these are the arguments that can be passed to this method:
|
| 131 |
+
|
| 132 |
+
* **similarity_threshold**: This is the percentage we discussed in step 3 for comparing elements' attributes. The default value is 0.2. In Simpler words, the attributes' values of both elements should be at least 20% similar. If you want to turn off this check (Step 3, basically), you can set this attribute to 0, but I recommend you read what other arguments do first.
|
| 133 |
+
* **ignore_attributes**: The attribute names passed will be ignored while matching the attributes in the last step. The default value is `('href', 'src',)` because URLs can change a lot between elements, making them unreliable.
|
| 134 |
+
* **match_text**: If `True`, the element's text content will be considered when matching. Using this in normal cases is not recommended, but it depends.
|
| 135 |
+
|
| 136 |
+
Now, let's check out the examples below.
|
| 137 |
+
|
| 138 |
+
### Examples
|
| 139 |
+
Let's see some shared examples of finding elements with raw text and regex.
|
| 140 |
+
|
| 141 |
+
I will use the `Fetcher` to clarify these examples, but it will be explained in detail later.
|
| 142 |
+
```python
|
| 143 |
+
from scrapling.fetchers import Fetcher
|
| 144 |
+
page = Fetcher.get('https://books.toscrape.com/index.html')
|
| 145 |
+
```
|
| 146 |
+
Find the first element whose text fully matches this text
|
| 147 |
+
```python
|
| 148 |
+
>>> page.find_by_text('Tipping the Velvet')
|
| 149 |
+
<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>
|
| 150 |
+
```
|
| 151 |
+
Combining it with `page.urljoin` to return the full URL from the relative `href`
|
| 152 |
+
```python
|
| 153 |
+
>>> page.find_by_text('Tipping the Velvet').attrib['href']
|
| 154 |
+
'catalogue/tipping-the-velvet_999/index.html'
|
| 155 |
+
>>> page.urljoin(page.find_by_text('Tipping the Velvet').attrib['href'])
|
| 156 |
+
'https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html'
|
| 157 |
+
```
|
| 158 |
+
Get all matches if there are more (hence, it returned a list)
|
| 159 |
+
```python
|
| 160 |
+
>>> page.find_by_text('Tipping the Velvet', first_match=False)
|
| 161 |
+
[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>]
|
| 162 |
+
```
|
| 163 |
+
Get all elements that contain the word `the` (Partial matching)
|
| 164 |
+
```python
|
| 165 |
+
>>> results = page.find_by_text('the', partial=True, first_match=False)
|
| 166 |
+
>>> [i.text for i in results]
|
| 167 |
+
['A Light in the ...',
|
| 168 |
+
'Tipping the Velvet',
|
| 169 |
+
'The Requiem Red',
|
| 170 |
+
'The Dirty Little Secrets ...',
|
| 171 |
+
'The Coming Woman: A ...',
|
| 172 |
+
'The Boys in the ...',
|
| 173 |
+
'The Black Maria',
|
| 174 |
+
'Mesaerion: The Best Science ...',
|
| 175 |
+
"It's Only the Himalayas"]
|
| 176 |
+
```
|
| 177 |
+
The search is case insensitive, so those results have `The`, not only the lowercase one `the`; let's limit the search to the elements with `the` only.
|
| 178 |
+
```python
|
| 179 |
+
>>> results = page.find_by_text('the', partial=True, first_match=False, case_sensitive=True)
|
| 180 |
+
>>> [i.text for i in results]
|
| 181 |
+
['A Light in the ...',
|
| 182 |
+
'Tipping the Velvet',
|
| 183 |
+
'The Boys in the ...',
|
| 184 |
+
"It's Only the Himalayas"]
|
| 185 |
+
```
|
| 186 |
+
Get the first element that its text content matches my price regex
|
| 187 |
+
```python
|
| 188 |
+
>>> page.find_by_regex(r'£[\d\.]+')
|
| 189 |
+
<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
|
| 190 |
+
>>> page.find_by_regex(r'£[\d\.]+').text
|
| 191 |
+
'£51.77'
|
| 192 |
+
```
|
| 193 |
+
It's the same if you pass the compiled regex as well; Scrapling will detect the input type and act upon that:
|
| 194 |
+
```python
|
| 195 |
+
>>> import re
|
| 196 |
+
>>> regex = re.compile(r'£[\d\.]+')
|
| 197 |
+
>>> page.find_by_regex(regex)
|
| 198 |
+
<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>
|
| 199 |
+
>>> page.find_by_regex(regex).text
|
| 200 |
+
'£51.77'
|
| 201 |
+
```
|
| 202 |
+
Get all elements that match the regex
|
| 203 |
+
```python
|
| 204 |
+
>>> page.find_by_regex(r'£[\d\.]+', first_match=False)
|
| 205 |
+
[<data='<p class="price_color">£51.77</p>' parent='<div class="product_price"> <p class="pr...'>,
|
| 206 |
+
<data='<p class="price_color">£53.74</p>' parent='<div class="product_price"> <p class="pr...'>,
|
| 207 |
+
<data='<p class="price_color">£50.10</p>' parent='<div class="product_price"> <p class="pr...'>,
|
| 208 |
+
<data='<p class="price_color">£47.82</p>' parent='<div class="product_price"> <p class="pr...'>,
|
| 209 |
+
...]
|
| 210 |
+
```
|
| 211 |
+
And so on...
|
| 212 |
+
|
| 213 |
+
Find all elements similar to the current element in location and attributes. For our case, ignore the 'title' attribute while matching
|
| 214 |
+
```python
|
| 215 |
+
>>> element = page.find_by_text('Tipping the Velvet')
|
| 216 |
+
>>> element.find_similar(ignore_attributes=['title'])
|
| 217 |
+
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<h3><a href="catalogue/a-light-in-the-at...'>,
|
| 218 |
+
<data='<a href="catalogue/soumission_998/index....' parent='<h3><a href="catalogue/soumission_998/in...'>,
|
| 219 |
+
<data='<a href="catalogue/sharp-objects_997/ind...' parent='<h3><a href="catalogue/sharp-objects_997...'>,
|
| 220 |
+
...]
|
| 221 |
+
```
|
| 222 |
+
Notice that the number of elements is 19, not 20, because the current element is not included in the results.
|
| 223 |
+
```python
|
| 224 |
+
>>> len(element.find_similar(ignore_attributes=['title']))
|
| 225 |
+
19
|
| 226 |
+
```
|
| 227 |
+
Get the `href` attribute from all similar elements
|
| 228 |
+
```python
|
| 229 |
+
>>> [
|
| 230 |
+
element.attrib['href']
|
| 231 |
+
for element in element.find_similar(ignore_attributes=['title'])
|
| 232 |
+
]
|
| 233 |
+
['catalogue/a-light-in-the-attic_1000/index.html',
|
| 234 |
+
'catalogue/soumission_998/index.html',
|
| 235 |
+
'catalogue/sharp-objects_997/index.html',
|
| 236 |
+
...]
|
| 237 |
+
```
|
| 238 |
+
To increase the complexity a little bit, let's say we want to get all books' data using that element as a starting point for some reason
|
| 239 |
+
```python
|
| 240 |
+
>>> for product in element.parent.parent.find_similar():
|
| 241 |
+
print({
|
| 242 |
+
"name": product.css_first('h3 a::text'),
|
| 243 |
+
"price": product.css_first('.price_color').re_first(r'[\d\.]+'),
|
| 244 |
+
"stock": product.css('.availability::text')[-1].clean()
|
| 245 |
+
})
|
| 246 |
+
{'name': 'A Light in the ...', 'price': '51.77', 'stock': 'In stock'}
|
| 247 |
+
{'name': 'Soumission', 'price': '50.10', 'stock': 'In stock'}
|
| 248 |
+
{'name': 'Sharp Objects', 'price': '47.82', 'stock': 'In stock'}
|
| 249 |
+
...
|
| 250 |
+
```
|
| 251 |
+
### Advanced examples
|
| 252 |
+
See more advanced or real-world examples using the `find_similar` method.
|
| 253 |
+
|
| 254 |
+
E-commerce Product Extraction
|
| 255 |
+
```python
|
| 256 |
+
def extract_product_grid(page):
|
| 257 |
+
# Find the first product card
|
| 258 |
+
first_product = page.find_by_text('Add to Cart').find_ancestor(
|
| 259 |
+
lambda e: e.has_class('product-card')
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
# Find similar product cards
|
| 263 |
+
products = first_product.find_similar()
|
| 264 |
+
|
| 265 |
+
return [
|
| 266 |
+
{
|
| 267 |
+
'name': p.css_first('h3::text'),
|
| 268 |
+
'price': p.css_first('.price::text').re_first(r'\d+\.\d{2}'),
|
| 269 |
+
'stock': 'In stock' in p.text,
|
| 270 |
+
'rating': p.css_first('.rating').attrib.get('data-rating')
|
| 271 |
+
}
|
| 272 |
+
for p in products
|
| 273 |
+
]
|
| 274 |
+
```
|
| 275 |
+
Table Row Extraction
|
| 276 |
+
```python
|
| 277 |
+
def extract_table_data(page):
|
| 278 |
+
# Find the first data row
|
| 279 |
+
first_row = page.css_first('table tbody tr')
|
| 280 |
+
|
| 281 |
+
# Find similar rows
|
| 282 |
+
rows = first_row.find_similar()
|
| 283 |
+
|
| 284 |
+
return [
|
| 285 |
+
{
|
| 286 |
+
'column1': row.css_first('td:nth-child(1)::text'),
|
| 287 |
+
'column2': row.css_first('td:nth-child(2)::text'),
|
| 288 |
+
'column3': row.css_first('td:nth-child(3)::text')
|
| 289 |
+
}
|
| 290 |
+
for row in rows
|
| 291 |
+
]
|
| 292 |
+
```
|
| 293 |
+
Form Field Extraction
|
| 294 |
+
```python
|
| 295 |
+
def extract_form_fields(page):
|
| 296 |
+
# Find first form field container
|
| 297 |
+
first_field = page.css_first('input').find_ancestor(
|
| 298 |
+
lambda e: e.has_class('form-field')
|
| 299 |
+
)
|
| 300 |
+
|
| 301 |
+
# Find similar field containers
|
| 302 |
+
fields = first_field.find_similar()
|
| 303 |
+
|
| 304 |
+
return [
|
| 305 |
+
{
|
| 306 |
+
'label': f.css_first('label::text'),
|
| 307 |
+
'type': f.css_first('input').attrib.get('type'),
|
| 308 |
+
'required': 'required' in f.css_first('input').attrib
|
| 309 |
+
}
|
| 310 |
+
for f in fields
|
| 311 |
+
]
|
| 312 |
+
```
|
| 313 |
+
Extracting reviews from a website
|
| 314 |
+
```python
|
| 315 |
+
def extract_reviews(page):
|
| 316 |
+
# Find first review
|
| 317 |
+
first_review = page.find_by_text('Great product!')
|
| 318 |
+
review_container = first_review.find_ancestor(
|
| 319 |
+
lambda e: e.has_class('review')
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
# Find similar reviews
|
| 323 |
+
all_reviews = review_container.find_similar()
|
| 324 |
+
|
| 325 |
+
return [
|
| 326 |
+
{
|
| 327 |
+
'text': r.css_first('.review-text::text'),
|
| 328 |
+
'rating': r.attrib.get('data-rating'),
|
| 329 |
+
'author': r.css_first('.reviewer::text')
|
| 330 |
+
}
|
| 331 |
+
for r in all_reviews
|
| 332 |
+
]
|
| 333 |
+
```
|
| 334 |
+
## Filters-based searching
|
| 335 |
+
This search method might be arguably the best way to find elements in Scrapling because it is powerful and easier to learn for newcomers to Web Scraping than learning to write selectors.
|
| 336 |
+
|
| 337 |
+
Inspired by BeautifulSoup's `find_all` function, you can find elements using the `find_all` and `find` methods. Both methods can take multiple types of filters and return all elements in the pages that all these filters apply to.
|
| 338 |
+
|
| 339 |
+
To be more specific:
|
| 340 |
+
|
| 341 |
+
* Any string passed is considered a tag name.
|
| 342 |
+
* Any iterable passed like List/Tuple/Set is considered an iterable of tag names.
|
| 343 |
+
* Any dictionary is considered a mapping of HTML element(s) attribute names and attribute values.
|
| 344 |
+
* Any regex patterns passed are used to filter elements by content like the `find_by_regex` method
|
| 345 |
+
* Any functions passed are used to filter elements
|
| 346 |
+
* Any keyword argument passed is considered as an HTML element attribute with its value.
|
| 347 |
+
|
| 348 |
+
It collects all passed arguments and keywords, and each filter passes its results to the following filter in a waterfall-like filtering system.
|
| 349 |
+
|
| 350 |
+
It filters all elements in the current page/element in the following order:
|
| 351 |
+
|
| 352 |
+
1. All elements with the passed tag name(s) get collected.
|
| 353 |
+
2. All elements that match all passed attribute(s) are collected; if a previous filter is used, then previously collected elements are filtered.
|
| 354 |
+
3. All elements that match all passed regex patterns are collected, or if previous filter(s) are used, then previously collected elements are filtered.
|
| 355 |
+
4. All elements that fulfill all passed function(s) are collected; if a previous filter(s) is used, then previously collected elements are filtered.
|
| 356 |
+
|
| 357 |
+
Notes:
|
| 358 |
+
|
| 359 |
+
1. As you probably understood, the filtering process always starts from the first filter it finds in the filtering order above. So, if no tag name(s) are passed but attributes are passed, the process starts from that layer, and so on.
|
| 360 |
+
2. The order in which you pass the arguments doesn't matter. The only order that's taken into consideration is the order explained above.
|
| 361 |
+
|
| 362 |
+
Check examples to clear any confusion :)
|
| 363 |
+
|
| 364 |
+
### Examples
|
| 365 |
+
```python
|
| 366 |
+
>>> from scrapling.fetchers import Fetcher
|
| 367 |
+
>>> page = Fetcher.get('https://quotes.toscrape.com/')
|
| 368 |
+
```
|
| 369 |
+
Find all elements with the tag name `div`.
|
| 370 |
+
```python
|
| 371 |
+
>>> page.find_all('div')
|
| 372 |
+
[<data='<div class="container"> <div class="row...' parent='<body> <div class="container"> <div clas...'>,
|
| 373 |
+
<data='<div class="row header-box"> <div class=...' parent='<div class="container"> <div class="row...'>,
|
| 374 |
+
...]
|
| 375 |
+
```
|
| 376 |
+
Find all div elements with a class that equals `quote`.
|
| 377 |
+
```python
|
| 378 |
+
>>> page.find_all('div', class_='quote')
|
| 379 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 380 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 381 |
+
...]
|
| 382 |
+
```
|
| 383 |
+
Same as above.
|
| 384 |
+
```python
|
| 385 |
+
>>> page.find_all('div', {'class': 'quote'})
|
| 386 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 387 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 388 |
+
...]
|
| 389 |
+
```
|
| 390 |
+
Find all elements with a class that equals `quote`.
|
| 391 |
+
```python
|
| 392 |
+
>>> page.find_all({'class': 'quote'})
|
| 393 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 394 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 395 |
+
...]
|
| 396 |
+
```
|
| 397 |
+
Find all div elements with a class that equals `quote` and contains the element `.text`, which contains the word 'world' in its content.
|
| 398 |
+
```python
|
| 399 |
+
>>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css_first('.text::text'))
|
| 400 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
|
| 401 |
+
```
|
| 402 |
+
Find all elements that don't have children.
|
| 403 |
+
```python
|
| 404 |
+
>>> page.find_all(lambda element: len(element.children) > 0)
|
| 405 |
+
[<data='<html lang="en"><head><meta charset="UTF...'>,
|
| 406 |
+
<data='<head><meta charset="UTF-8"><title>Quote...' parent='<html lang="en"><head><meta charset="UTF...'>,
|
| 407 |
+
<data='<body> <div class="container"> <div clas...' parent='<html lang="en"><head><meta charset="UTF...'>,
|
| 408 |
+
...]
|
| 409 |
+
```
|
| 410 |
+
Find all elements that contain the word 'world' in its content.
|
| 411 |
+
```python
|
| 412 |
+
>>> page.find_all(lambda element: "world" in element.text)
|
| 413 |
+
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>,
|
| 414 |
+
<data='<a class="tag" href="/tag/world/page/1/"...' parent='<div class="tags"> Tags: <meta class="ke...'>]
|
| 415 |
+
```
|
| 416 |
+
Find all span elements that match the given regex
|
| 417 |
+
```python
|
| 418 |
+
>>> page.find_all('span', re.compile(r'world'))
|
| 419 |
+
[<data='<span class="text" itemprop="text">“The...' parent='<div class="quote" itemscope itemtype="h...'>]
|
| 420 |
+
```
|
| 421 |
+
Find all div and span elements with class 'quote' (No span elements like that, so only div returned)
|
| 422 |
+
```python
|
| 423 |
+
>>> page.find_all(['div', 'span'], {'class': 'quote'})
|
| 424 |
+
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 425 |
+
<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>,
|
| 426 |
+
...]
|
| 427 |
+
```
|
| 428 |
+
Mix things up
|
| 429 |
+
```python
|
| 430 |
+
>>> page.find_all({'itemtype':"http://schema.org/CreativeWork"}, 'div').css('.author::text')
|
| 431 |
+
['Albert Einstein',
|
| 432 |
+
'J.K. Rowling',
|
| 433 |
+
...]
|
| 434 |
+
```
|
| 435 |
+
A bonus pro tip: Find all elements whose `href` attribute's value ends with the word 'Einstein'.
|
| 436 |
+
```python
|
| 437 |
+
>>> page.find_all({'href$': 'Einstein'})
|
| 438 |
+
[<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
|
| 439 |
+
<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
|
| 440 |
+
<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>]
|
| 441 |
+
```
|
| 442 |
+
Another pro tip: Find all elements that its `href` attribute's value has '/author/' in it
|
| 443 |
+
```python
|
| 444 |
+
>>> page.find_all({'href*': '/author/'})
|
| 445 |
+
[<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
|
| 446 |
+
<data='<a href="/author/J-K-Rowling">(about)</a...' parent='<span>by <small class="author" itemprop=...'>,
|
| 447 |
+
<data='<a href="/author/Albert-Einstein">(about...' parent='<span>by <small class="author" itemprop=...'>,
|
| 448 |
+
...]
|
| 449 |
+
```
|
| 450 |
+
And so on...
|
| 451 |
+
|
| 452 |
+
## Generating selectors
|
| 453 |
+
You can always generate CSS/XPath selectors for any element that can be reused here or anywhere else, and the most remarkable thing is that it doesn't matter what method you used to find that element!
|
| 454 |
+
|
| 455 |
+
Generate a short CSS selector for the `url_element` element (if possible, create a short one; otherwise, it's a full selector)
|
| 456 |
+
```python
|
| 457 |
+
>>> url_element = page.find({'href*': '/author/'})
|
| 458 |
+
>>> url_element.generate_css_selector
|
| 459 |
+
'body > div > div:nth-of-type(2) > div > div > span:nth-of-type(2) > a'
|
| 460 |
+
```
|
| 461 |
+
Generate a full CSS selector for the `url_element` element from the start of the page
|
| 462 |
+
```python
|
| 463 |
+
>>> url_element.generate_full_css_selector
|
| 464 |
+
'body > div > div:nth-of-type(2) > div > div > span:nth-of-type(2) > a'
|
| 465 |
+
```
|
| 466 |
+
Generate a short XPath selector for the `url_element` element (if possible, create a short one; otherwise, it's a full selector)
|
| 467 |
+
```python
|
| 468 |
+
>>> url_element.generate_xpath_selector
|
| 469 |
+
'//body/div/div[2]/div/div/span[2]/a'
|
| 470 |
+
```
|
| 471 |
+
Generate a full XPath selector for the `url_element` element from the start of the page
|
| 472 |
+
```python
|
| 473 |
+
>>> url_element.generate_full_xpath_selector
|
| 474 |
+
'//body/div/div[2]/div/div/span[2]/a'
|
| 475 |
+
```
|
| 476 |
+
> Note: <br>
|
| 477 |
+
> When you tell Scrapling to create a short selector, it tries to find a unique element to use in generation as a stop point, like an element with an `id` attribute, but in our case, there wasn't any so that's why the short and the full selector will be the same.
|
| 478 |
+
|
| 479 |
+
## Using selectors with regular expressions
|
| 480 |
+
Like in `parsel`/`scrapy`, you have the methods `re` and `re_first` for extracting data using regular expressions. However, unlike the former, these methods are in nearly all classes like `Adaptor`/`Adaptors`/`TextHandler` and `TextHandlers`, which means you can use them directly on the element even if you didn't select a text node.
|
| 481 |
+
|
| 482 |
+
We will have a deep look at it while explaining the [TextHandler](main_classes.md#texthandler) class, but in general, it works like the below examples:
|
| 483 |
+
```python
|
| 484 |
+
>>> page.css_first('.price_color').re_first(r'[\d\.]+')
|
| 485 |
+
'51.77'
|
| 486 |
+
|
| 487 |
+
>>> page.css('.price_color').re_first(r'[\d\.]+')
|
| 488 |
+
'51.77'
|
| 489 |
+
|
| 490 |
+
>>> page.css('.price_color').re(r'[\d\.]+')
|
| 491 |
+
['51.77',
|
| 492 |
+
'53.74',
|
| 493 |
+
'50.10',
|
| 494 |
+
'47.82',
|
| 495 |
+
'54.23',
|
| 496 |
+
...]
|
| 497 |
+
|
| 498 |
+
>>> page.css('.product_pod h3 a::attr(href)').re(r'catalogue/(.*)/index.html')
|
| 499 |
+
['a-light-in-the-attic_1000',
|
| 500 |
+
'tipping-the-velvet_999',
|
| 501 |
+
'soumission_998',
|
| 502 |
+
'sharp-objects_997',
|
| 503 |
+
...]
|
| 504 |
+
|
| 505 |
+
>>> filtering_function = lambda e: e.parent.tag == 'h3' and e.parent.parent.has_class('product_pod') # As above selector
|
| 506 |
+
>>> page.find('a', filtering_function).attrib['href'].re(r'catalogue/(.*)/index.html')
|
| 507 |
+
['a-light-in-the-attic_1000']
|
| 508 |
+
|
| 509 |
+
>>> page.find_by_text('Tipping the Velvet').attrib['href'].re(r'catalogue/(.*)/index.html')
|
| 510 |
+
['tipping-the-velvet_999']
|
| 511 |
+
```
|
| 512 |
+
And so on. You get the idea. We will explain this in more detail on the next page while explaining the [TextHandler](main_classes.md#texthandler) class.
|
docs/stylesheets/extra.css
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.md-grid {
|
| 2 |
+
max-width: 65%;
|
| 3 |
+
}
|
docs/tutorials/migrating_from_beautifulsoup.md
ADDED
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Migrating from BeautifulSoup to Scrapling
|
| 2 |
+
|
| 3 |
+
<style>
|
| 4 |
+
.md-grid {
|
| 5 |
+
max-width: 85%;
|
| 6 |
+
}
|
| 7 |
+
</style>
|
| 8 |
+
|
| 9 |
+
If you're already familiar with BeautifulSoup, you're in for a treat. Scrapling is faster, provides similar parsing capabilities, and adds powerful new features for fetching and handling modern web pages. This guide will help you quickly adapt your existing BeautifulSoup code to take advantage of Scrapling's capabilities.
|
| 10 |
+
|
| 11 |
+
Below is a table that covers the most common operations you'll perform when scraping web pages. Each row shows how to accomplish a specific task in BeautifulSoup and the corresponding way to do it in Scrapling.
|
| 12 |
+
|
| 13 |
+
You will notice some shortcuts in BeautifulSoup are missing in Scrapling, but that's one of the reasons that makes BeautifulSoup slower than Scrapling. The point is: If the same feature can be used in a short oneliner, there is no need to sacrifice performance to make that short line shorter :)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
| Task | BeautifulSoup Code | Scrapling Code |
|
| 17 |
+
|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
|
| 18 |
+
| Parser import | `from bs4 import BeautifulSoup` | `from scrapling.parser import Adaptor` |
|
| 19 |
+
| Parsing HTML from string | `soup = BeautifulSoup(html, 'html.parser')` | `page = Adaptor(html)` |
|
| 20 |
+
| Finding a single element | `element = soup.find('div', class_='example')` | `element = page.find('div', class_='example')` |
|
| 21 |
+
| Finding multiple elements | `elements = soup.find_all('div', class_='example')` | `elements = page.find_all('div', class_='example')` |
|
| 22 |
+
| Finding a single element (Example 2) | `element = soup.find('div', attrs={"class": "example"})` | `element = page.find('div', {"class": "example"})` |
|
| 23 |
+
| Finding a single element (Example 3) | `element = soup.find(re.compile("^b"))` | `element = page.find(re.compile("^b"))`<br/>`element = page.find_by_regex(r"^b")` |
|
| 24 |
+
| Finding a single element (Example 4) | `element = soup.find(lambda e: len(list(e.children)) > 0)` | `element = page.find(lambda e: len(e.children) > 0)` |
|
| 25 |
+
| Finding a single element (Example 5) | `element = soup.find(["a", "b"])` | `element = page.find(["a", "b"])` |
|
| 26 |
+
| Find element by its text content | `element = soup.find(text="some text")` | `element = page.find_by_text("some text", partial=False)` |
|
| 27 |
+
| Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css_first('div.example')` |
|
| 28 |
+
| Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
|
| 29 |
+
| Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
|
| 30 |
+
| Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.body` |
|
| 31 |
+
| Get tag name of an element | `name = element.name` | `name = element.tag` |
|
| 32 |
+
| Extracting text content of an element | `string = element.string` | `string = element.text` |
|
| 33 |
+
| Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
|
| 34 |
+
| Access the dictionary of attributes | `attrs = element.attrs` | `attrs = element.attrib` |
|
| 35 |
+
| Extracting attributes | `attr = element['href']` | `attr = element.attrib['href']` |
|
| 36 |
+
| Navigating to parent | `parent = element.parent` | `parent = element.parent` |
|
| 37 |
+
| Get all parents of an element | `parents = list(element.parents)` | `parents = list(element.iterancestors())` |
|
| 38 |
+
| Searching for an element in the parents of an element | `target_parent = element.find_parent("a")` | `target_parent = element.find_ancestor(lambda p: p.tag == 'a')` |
|
| 39 |
+
| Get all siblings of an element | N/A | `siblings = element.siblings` |
|
| 40 |
+
| Get next sibling of an element | `next_element = element.next_sibling` | `next_element = element.next` |
|
| 41 |
+
| Searching for an element in the siblings of an element | `target_sibling = element.find_next_sibling("a")`<br/>`target_sibling = element.find_previous_sibling("a")` | `target_sibling = element.siblings.search(lambda s: s.tag == 'a')` |
|
| 42 |
+
| Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
|
| 43 |
+
| Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
|
| 44 |
+
| Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
|
| 45 |
+
| Searching for an element in the previous elements of an element | `target_parent = element.find_previous("a")` | `target_parent = element.path.search(lambda p: p.tag == 'a')` |
|
| 46 |
+
| Searching for elements in the previous elements of an element | `target_parent = element.find_all_previous("a")` | `target_parent = element.path.filter(lambda p: p.tag == 'a')` |
|
| 47 |
+
| Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
|
| 48 |
+
| Navigating to children | `children = list(element.children)` | `children = element.children` |
|
| 49 |
+
| Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
|
| 50 |
+
| Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
One point to remember: BeautifulSoup provides features for modifying and manipulating the page after parsing it. Scrapling focuses more on Scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web SScraping, but one of them specializes in Web Scraping :)
|
| 54 |
+
|
| 55 |
+
### Putting It All Together
|
| 56 |
+
|
| 57 |
+
Here's a simple example of scraping a web page to extract all the links using BeautifulSoup and Scrapling.
|
| 58 |
+
|
| 59 |
+
**With BeautifulSoup:**
|
| 60 |
+
|
| 61 |
+
```python
|
| 62 |
+
import requests
|
| 63 |
+
from bs4 import BeautifulSoup
|
| 64 |
+
|
| 65 |
+
url = 'http://example.com'
|
| 66 |
+
response = requests.get(url)
|
| 67 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
| 68 |
+
|
| 69 |
+
links = soup.find_all('a')
|
| 70 |
+
for link in links:
|
| 71 |
+
print(link['href'])
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
**With Scrapling:**
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
from scrapling import Fetcher
|
| 78 |
+
|
| 79 |
+
url = 'http://example.com'
|
| 80 |
+
page = Fetcher.get(url=url)
|
| 81 |
+
|
| 82 |
+
links = page.css('a::attr(href)')
|
| 83 |
+
for link in links:
|
| 84 |
+
print(link)
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
As you can see, Scrapling simplifies the process by handling the fetching and parsing in a single step, making your code cleaner and more efficient.
|
| 88 |
+
|
| 89 |
+
**Additional Notes:**
|
| 90 |
+
|
| 91 |
+
- **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
|
| 92 |
+
- **Element Types**: In BeautifulSoup, elements are `Tag` objects, while in Scrapling, they are `Adaptor` objects. However, they provide similar methods and properties for navigation and data extraction.
|
| 93 |
+
- **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.css_first()`). To avoid errors, Check for `None` before accessing properties.
|
| 94 |
+
- **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can be helpful for removing extra whitespace or unwanted characters. Please check out the documentation for the complete list.
|
| 95 |
+
|
| 96 |
+
The documentation provides more details on Scrapling's features and the full list of arguments that can be passed to all methods.
|
| 97 |
+
|
| 98 |
+
This guide should make your transition from BeautifulSoup to Scrapling smooth and straightforward. Happy scraping!
|
docs/tutorials/replacing_ai.md
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
WIP
|
mkdocs.yml
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
site_name: Scrapling
|
| 2 |
+
site_description: Scrapling - a Python library to make Web Scraping easy again!
|
| 3 |
+
site_author: Karim Shoair
|
| 4 |
+
repo_url: https://github.com/D4Vinci/Scrapling
|
| 5 |
+
repo_name: D4Vinci/Scrapling
|
| 6 |
+
copyright: Copyright © 2025 Karim Shoair
|
| 7 |
+
|
| 8 |
+
theme:
|
| 9 |
+
name: material
|
| 10 |
+
language: en
|
| 11 |
+
palette:
|
| 12 |
+
- media: "(prefers-color-scheme)"
|
| 13 |
+
toggle:
|
| 14 |
+
icon: material/link
|
| 15 |
+
name: Switch to light mode
|
| 16 |
+
- media: "(prefers-color-scheme: light)"
|
| 17 |
+
scheme: default
|
| 18 |
+
primary: indigo
|
| 19 |
+
accent: indigo
|
| 20 |
+
toggle:
|
| 21 |
+
icon: material/toggle-switch
|
| 22 |
+
name: Switch to dark mode
|
| 23 |
+
- media: "(prefers-color-scheme: dark)"
|
| 24 |
+
scheme: slate
|
| 25 |
+
primary: black
|
| 26 |
+
accent: indigo
|
| 27 |
+
toggle:
|
| 28 |
+
icon: material/toggle-switch-off
|
| 29 |
+
name: Switch to system preference
|
| 30 |
+
font:
|
| 31 |
+
text: Roboto
|
| 32 |
+
code: Roboto Mono
|
| 33 |
+
icon:
|
| 34 |
+
repo: fontawesome/brands/github-alt
|
| 35 |
+
features:
|
| 36 |
+
- announce.dismiss
|
| 37 |
+
- navigation.top
|
| 38 |
+
- navigation.footer
|
| 39 |
+
- navigation.instant
|
| 40 |
+
- navigation.indexes
|
| 41 |
+
- navigation.sections
|
| 42 |
+
- navigation.tracking
|
| 43 |
+
- navigation.instant
|
| 44 |
+
- navigation.instant.progress
|
| 45 |
+
# - navigation.tabs
|
| 46 |
+
# - navigation.expand
|
| 47 |
+
# - toc.integrate
|
| 48 |
+
- search.share
|
| 49 |
+
- search.suggest
|
| 50 |
+
- search.highlight
|
| 51 |
+
- content.tabs.link
|
| 52 |
+
- content.width.full
|
| 53 |
+
- content.action.view
|
| 54 |
+
- content.action.edit
|
| 55 |
+
- content.code.copy
|
| 56 |
+
- content.code.annotate
|
| 57 |
+
- content.code.annotation
|
| 58 |
+
# logo: assets/logo.png
|
| 59 |
+
# favicon: assets/favicon.png
|
| 60 |
+
|
| 61 |
+
nav:
|
| 62 |
+
- Introduction: index.md
|
| 63 |
+
- Overview: overview.md
|
| 64 |
+
- Parsing Performance: benchmarks.md
|
| 65 |
+
- User Guide:
|
| 66 |
+
- Parsing:
|
| 67 |
+
- Querying elements: parsing/selection.md
|
| 68 |
+
- Main classes: parsing/main_classes.md
|
| 69 |
+
- Using automatch feature: parsing/automatch.md
|
| 70 |
+
- Fetching:
|
| 71 |
+
- Choosing a fetcher: fetching/choosing.md
|
| 72 |
+
- Static requests: fetching/static.md
|
| 73 |
+
- Dynamically loaded websites: fetching/dynamic.md
|
| 74 |
+
- Fully bypass protections while fetching: fetching/stealthy.md
|
| 75 |
+
- Tutorials:
|
| 76 |
+
- Using Scrapling instead of AI: tutorials/replacing_ai.md
|
| 77 |
+
- Migrating from BeautifulSoup: tutorials/migrating_from_beautifulsoup.md
|
| 78 |
+
# - Migrating from AutoScraper: tutorials/migrating_from_autoscraper.md
|
| 79 |
+
- Development:
|
| 80 |
+
- API Reference:
|
| 81 |
+
- Adaptor: api-reference/adaptor.md
|
| 82 |
+
- Fetchers: api-reference/fetchers.md
|
| 83 |
+
- Custom Types: api-reference/custom-types.md
|
| 84 |
+
- Writing your retrieval system: development/automatch_storage_system.md
|
| 85 |
+
- Using Scrapling's custom types: development/scrapling_custom_types.md
|
| 86 |
+
- Support and Sponsors: donate.md
|
| 87 |
+
- Contributing: contributing.md
|
| 88 |
+
- Changelog: 'https://github.com/D4Vinci/Scrapling/releases'
|
| 89 |
+
|
| 90 |
+
markdown_extensions:
|
| 91 |
+
- admonition
|
| 92 |
+
- abbr
|
| 93 |
+
# - mkautodoc
|
| 94 |
+
- pymdownx.emoji
|
| 95 |
+
- pymdownx.details
|
| 96 |
+
- pymdownx.superfences
|
| 97 |
+
- pymdownx.highlight:
|
| 98 |
+
anchor_linenums: true
|
| 99 |
+
- pymdownx.inlinehilite
|
| 100 |
+
- pymdownx.snippets
|
| 101 |
+
- pymdownx.tabbed:
|
| 102 |
+
alternate_style: true
|
| 103 |
+
- tables
|
| 104 |
+
- codehilite:
|
| 105 |
+
css_class: highlight
|
| 106 |
+
- toc:
|
| 107 |
+
permalink: true
|
| 108 |
+
|
| 109 |
+
plugins:
|
| 110 |
+
- search
|
| 111 |
+
- mkdocstrings:
|
| 112 |
+
handlers:
|
| 113 |
+
python:
|
| 114 |
+
paths: [scrapling]
|
| 115 |
+
options:
|
| 116 |
+
docstring_style: sphinx
|
| 117 |
+
show_source: true
|
| 118 |
+
show_root_heading: true
|
| 119 |
+
show_if_no_docstring: true
|
| 120 |
+
inherited_members: true
|
| 121 |
+
members_order: source
|
| 122 |
+
separate_signature: true
|
| 123 |
+
unwrap_annotated: true
|
| 124 |
+
filters:
|
| 125 |
+
- '!^_'
|
| 126 |
+
merge_init_into_class: true
|
| 127 |
+
docstring_section_style: spacy
|
| 128 |
+
signature_crossrefs: true
|
| 129 |
+
show_symbol_type_heading: true
|
| 130 |
+
show_symbol_type_toc: true
|
| 131 |
+
|
| 132 |
+
extra:
|
| 133 |
+
social:
|
| 134 |
+
- icon: fontawesome/brands/github
|
| 135 |
+
link: https://github.com/D4Vinci/Scrapling
|
| 136 |
+
- icon: fontawesome/brands/python
|
| 137 |
+
link: https://pypi.org/project/scrapling/
|
| 138 |
+
- icon: fontawesome/brands/x-twitter
|
| 139 |
+
link: https://x.com/D4Vinci1
|
| 140 |
+
|
| 141 |
+
extra_css:
|
| 142 |
+
- stylesheets/extra.css
|