Greg Wilson commited on
Commit
aaef24a
·
1 Parent(s): ff3a291

feat: overhaul for relaunch

Browse files

- Create `requirements.txt` with pinned dependencies for building site.
- Update .gitignore.
- Enhance Makefile.
- Rename existing lessons to be lesson/dd_specific.py (two digits, lower case).
- Notebooks with other names are not included in index page.
- Add SQL tutorial.
- Add scripts to regenerate SQLite databases.
- Add queueing theory tutorial.
- Rename `scripts` directory to `bin`.
- Replace old `build.py` with:
- `bin/extract.py` gets metadata from `*/index.md` lesson pages.
- `bin/build.py` builds root home page and lesson home pages.
- `bin/check_empty_cells.py`: look for empty cells in notebooks (enhanced).
- `bin/check_missing_titles.py`: look for notebooks without an H1 title.
- `bin/check_notebook_packages.py`: check consistency of package versions within lesson.
- Add `make check_packages NOTEBOOKS="*/??_*.py"` to check package consistency within lesson.
- If `NOTEBOOKS` not specified, all notebooks are checked.
- Add `make check_exec NOTEBOOKS="*/??_*.py"` to check notebook execution.
- If `NOTEBOOKS` not specified, all notebooks are executed (slow).
- Fix missing package imports in notebook headers.
- Pin package versions in notebook headers.
- Make content of lesson home pages uniform.
- Update GitHub workflows to launch commands from Makefile.
- Requires using `uv` in workflows.
- Extract and modify CSS.
- Putt SVG icons in includeable files in `templates/icons/*.svg`.
- Make titles of notebooks more uniform.
- Building `pages/*.md` using `templates/page.html`.
- Add link checker.
- Requires local server to be running, and takes 10 minutes or more to execute.
- Fix multiple bugs in individual lessons.
- Most introduced by package version pinning.
- See notes below for outstanding issues.

Note: build [`marimo_learn`](https://github.com/gvwilson/marimo_learn) package
with utilities to localize SQLite database files.

Add `disabled=True` to prevent execution of deliberately buggy cells in script mode (?).

The code at line 497–499 calls `lz.sink_csv(..., lazy=True)`. The
`lazy=True` argument was added to return a lazy sink that could be
passed to pl.collect_all() for parallel execution — rather than
immediately writing the file. However, in polars 1.24.0, the lazy
parameter was removed from sink_csv() (and likely sink_parquet(),
sink_ndjson() too). The API for collecting multiple sinks in parallel
has changed.

These notebook use the `hf://` protocol to stream a parquet file
directly from Hugging Face:

```
URL = f"hf://datasets/{repo_id}@{branch}/{file_path}"
```

Polars is URL-encoding the slash in the repo name when it calls the HF
API, which then rejects it as an invalid repo name. The fix is to
download the file and store it locally, or make it available in some
other location.

Kagglehub requires Kaggle API credentials not available in the
browser. Either remove the data-loading step or substitute a bundled
sample dataset.

Replace numba with a pure-Python alternative for the WASM version, or
gate the numba cells with a WASM check and change prose accordingly:

```
import sys
if "pyodide" not in sys.modules:
import numba
```

- Add Altair notebooks from https://uwdata.github.io/visualization-curriculum/
- Add formative assessment widgets
- Update Little's Law notebook

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .github/workflows/check-empty-cells.yml +4 -1
  2. .github/workflows/deploy.yml +1 -1
  3. .gitignore +9 -0
  4. .typos.toml +3 -0
  5. Makefile +105 -13
  6. _server/README.md +0 -5
  7. _server/main.py +3 -3
  8. altair/01_introduction.py +671 -0
  9. altair/02_marks_encoding.py +1126 -0
  10. altair/03_data_transformation.py +641 -0
  11. altair/04_scales_axes_legends.py +840 -0
  12. altair/05_view_composition.py +818 -0
  13. altair/06_interaction.py +671 -0
  14. altair/07_cartographic.py +898 -0
  15. altair/08_debugging.py +370 -0
  16. altair/altair_introduction.py.lock +0 -0
  17. altair/index.md +14 -0
  18. assets/styles.css +51 -0
  19. bin/build.py +93 -0
  20. {scripts → bin}/check_empty_cells.py +2 -3
  21. bin/check_missing_titles.py +21 -0
  22. bin/check_notebook_packages.py +110 -0
  23. bin/create_sql_lab.sql +22 -0
  24. bin/create_sql_penguins.py +50 -0
  25. bin/create_sql_survey.py +175 -0
  26. bin/extract.py +47 -0
  27. {scripts → bin}/preview.py +1 -2
  28. bin/run_notebooks.sh +11 -0
  29. bin/utils.py +14 -0
  30. daft/README.md +0 -31
  31. daft/_index.md +13 -0
  32. data/penguins.csv +345 -0
  33. duckdb/01_getting_started.py +8 -11
  34. duckdb/{008_loading_parquet.py → 08_loading_parquet.py} +3 -3
  35. duckdb/{009_loading_json.py → 09_loading_json.py} +3 -3
  36. duckdb/{011_working_with_apache_arrow.py → 11_working_with_apache_arrow.py} +21 -28
  37. duckdb/DuckDB_Loading_CSVs.py +3 -4
  38. duckdb/README.md +0 -37
  39. duckdb/index.md +16 -0
  40. {functional_programming → functional}/05_functors.py +1 -1
  41. {functional_programming → functional}/06_applicatives.py +1 -1
  42. functional/_index.md +25 -0
  43. functional_programming/CHANGELOG.md +0 -129
  44. functional_programming/README.md +0 -77
  45. optimization/01_least_squares.py +3 -3
  46. optimization/02_linear_program.py +5 -5
  47. optimization/03_minimum_fuel_optimal_control.py +8 -4
  48. optimization/04_quadratic_program.py +5 -5
  49. optimization/05_portfolio_optimization.py +9 -9
  50. optimization/06_convex_optimization.py +3 -3
.github/workflows/check-empty-cells.yml CHANGED
@@ -17,6 +17,9 @@ jobs:
17
  - name: 🔄 Checkout code
18
  uses: actions/checkout@v4
19
 
 
 
 
20
  - name: 🐍 Set up Python
21
  uses: actions/setup-python@v5
22
  with:
@@ -24,7 +27,7 @@ jobs:
24
 
25
  - name: 🔍 Check for empty cells
26
  run: |
27
- python scripts/check_empty_cells.py
28
 
29
  - name: 📊 Report results
30
  if: failure()
 
17
  - name: 🔄 Checkout code
18
  uses: actions/checkout@v4
19
 
20
+ - name: 🚀 Install uv
21
+ uses: astral-sh/setup-uv@v4
22
+
23
  - name: 🐍 Set up Python
24
  uses: actions/setup-python@v5
25
  with:
 
27
 
28
  - name: 🔍 Check for empty cells
29
  run: |
30
+ make check_empty
31
 
32
  - name: 📊 Report results
33
  if: failure()
.github/workflows/deploy.yml CHANGED
@@ -32,7 +32,7 @@ jobs:
32
 
33
  - name: 🛠️ Export notebooks
34
  run: |
35
- python scripts/build.py
36
 
37
  - name: 📤 Upload artifact
38
  uses: actions/upload-pages-artifact@v3
 
32
 
33
  - name: 🛠️ Export notebooks
34
  run: |
35
+ make build
36
 
37
  - name: 📤 Upload artifact
38
  uses: actions/upload-pages-artifact@v3
.gitignore CHANGED
@@ -175,3 +175,12 @@ __marimo__
175
 
176
  # Generated site content
177
  _site/
 
 
 
 
 
 
 
 
 
 
175
 
176
  # Generated site content
177
  _site/
178
+
179
+ # Editors
180
+ *~
181
+
182
+ # Temporary build files
183
+ tmp/
184
+ example.db
185
+ example.db.wal
186
+ log_data_filtered*.*
.typos.toml CHANGED
@@ -15,7 +15,10 @@ extend-ignore-re = [
15
 
16
  # Words to explicitly accept
17
  [default.extend-words]
 
18
  pn = "pn"
 
 
19
 
20
  # You can also exclude specific files or directories if needed
21
  # [files]
 
15
 
16
  # Words to explicitly accept
17
  [default.extend-words]
18
+ bimap = "bimap"
19
  pn = "pn"
20
+ setp = "setp"
21
+ Plas = "Plas"
22
 
23
  # You can also exclude specific files or directories if needed
24
  # [files]
Makefile CHANGED
@@ -1,24 +1,116 @@
1
- # Default target.
2
- all : commands
 
 
 
3
 
4
- ## commands : show all commands.
5
- commands :
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  @grep -h -E '^##' ${MAKEFILE_LIST} | sed -e 's/## //g' | column -t -s ':'
7
 
8
- ## install: install minimal required packages into current environment.
9
  install:
10
- uv pip install marimo jinja2 markdown
 
 
 
 
 
 
 
11
 
12
- ## build: build entire site.
13
- build:
14
- rm -rf _site
15
- uv run scripts/build.py
 
 
 
16
 
17
- ## serve: run local web server without rebuilding.
 
 
 
 
 
 
 
 
18
  serve:
19
- uv run python -m http.server --directory _site
 
 
 
20
 
21
- ## clean: clean up stray files.
 
 
22
  clean:
23
  @find . -name '*~' -exec rm {} +
24
  @find . -name '.DS_Store' -exec rm {} +
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ROOT := .
2
+ SITE := _site
3
+ TMP := ./tmp
4
+ LESSON_DATA := ${TMP}/lessons.json
5
+ TEMPLATES := $(wildcard templates/*.html)
6
 
7
+ NOTEBOOK_INDEX := $(wildcard */index.md)
8
+ NOTEBOOK_DIR := $(patsubst %/index.md,%,${NOTEBOOK_INDEX})
9
+ NOTEBOOK_SRC := $(foreach dir,$(NOTEBOOK_DIR),$(wildcard $(dir)/??_*.py))
10
+ NOTEBOOK_OUT := $(patsubst %.py,${SITE}/%.html,$(NOTEBOOK_SRC))
11
+
12
+ DATABASES := \
13
+ sql/public/lab.db \
14
+ sql/public/penguins.db \
15
+ sql/public/survey.db
16
+
17
+ MARIMO := uv run marimo
18
+ PYTHON := uv run python
19
+
20
+ # Default target
21
+ all: commands
22
+
23
+ ## commands : show all commands
24
+ commands:
25
  @grep -h -E '^##' ${MAKEFILE_LIST} | sed -e 's/## //g' | column -t -s ':'
26
 
27
+ ## install: install required packages
28
  install:
29
+ uv pip install -r requirements.txt
30
+
31
+ ## check: run all simple checks
32
+ check:
33
+ -@make check_empty
34
+ -@make check_titles
35
+ -@make check_typos
36
+ -@make check_packages
37
 
38
+ ## check_exec: run notebooks to check for runtime errors
39
+ check_exec:
40
+ @if [ -z "$(NOTEBOOKS)" ]; then \
41
+ bash bin/run_notebooks.sh $(NOTEBOOK_SRC); \
42
+ else \
43
+ bash bin/run_notebooks.sh $(NOTEBOOKS); \
44
+ fi
45
 
46
+ ## build: build website
47
+ build: ${LESSON_DATA} ${NOTEBOOK_OUT} ${TEMPLATES}
48
+ ${PYTHON} bin/build.py --root ${ROOT} --output ${SITE} --data ${LESSON_DATA}
49
+
50
+ ## links: check links locally (while 'make serve')
51
+ links:
52
+ linkchecker -F text http://localhost:8000
53
+
54
+ ## serve: run local web server without rebuilding
55
  serve:
56
+ ${PYTHON} -m http.server --directory ${SITE}
57
+
58
+ ## databases: rebuild datasets for SQL lessons
59
+ databases: ${DATABASES}
60
 
61
+ ## ---: ---
62
+
63
+ ## clean: clean up stray files
64
  clean:
65
  @find . -name '*~' -exec rm {} +
66
  @find . -name '.DS_Store' -exec rm {} +
67
+ @rm -rf ${TMP}
68
+ @rm -f log_data_filtered*.*
69
+
70
+ ## check_empty: check for empty cells
71
+ check_empty:
72
+ @${PYTHON} bin/check_empty_cells.py
73
+
74
+ ## check_titles: check for missing titles in notebooks
75
+ check_titles:
76
+ @${PYTHON} bin/check_missing_titles.py
77
+
78
+ ## check_packages: check for inconsistent package versions across notebooks
79
+ check_packages:
80
+ @if [ -z "$(NOTEBOOKS)" ]; then \
81
+ ${PYTHON} bin/check_notebook_packages.py $(NOTEBOOK_SRC); \
82
+ else \
83
+ ${PYTHON} bin/check_notebook_packages.py $(NOTEBOOKS); \
84
+ fi
85
+
86
+ ## check_typos: check for typos
87
+ check_typos:
88
+ @typos ${TEMPLATES} ${NOTEBOOK_INDEX} ${NOTEBOOK_SRC}
89
+
90
+ ## extract: extract lesson data
91
+ extract: ${LESSON_DATA}
92
+
93
+ #
94
+ # subsidiary targets
95
+ #
96
+
97
+ tmp/lessons.json: $(NOTEBOOK_INDEX)
98
+ ${PYTHON} bin/extract.py --root ${ROOT} --data ${LESSON_DATA}
99
+
100
+ ${SITE}/%.html: %.py
101
+ ${MARIMO} export html-wasm --force --mode edit $< -o $@ --sandbox
102
+
103
+ sql/public/lab.db: bin/create_sql_lab.sql
104
+ @rm -f $@
105
+ @mkdir -p sql/public
106
+ sqlite3 $@ < $<
107
+
108
+ sql/public/penguins.db: bin/create_sql_penguins.py data/penguins.csv
109
+ @rm -f $@
110
+ @mkdir -p sql/public
111
+ ${PYTHON} $< data/penguins.csv $@
112
+
113
+ sql/public/survey.db: bin/create_sql_survey.py
114
+ @rm -f $@
115
+ @mkdir -p sql/public
116
+ ${PYTHON} $< $@ 192837
_server/README.md CHANGED
@@ -1,8 +1,3 @@
1
- ---
2
- title: Readme
3
- marimo-version: 0.18.4
4
- ---
5
-
6
  # marimo learn server
7
 
8
  This folder contains server code for hosting marimo apps.
 
 
 
 
 
 
1
  # marimo learn server
2
 
3
  This folder contains server code for hosting marimo apps.
_server/main.py CHANGED
@@ -6,14 +6,14 @@
6
  # "starlette",
7
  # "python-dotenv",
8
  # "pydantic",
9
- # "duckdb==1.3.2",
10
- # "altair==5.5.0",
11
  # "beautifulsoup4==4.13.3",
12
  # "httpx==0.28.1",
13
  # "marimo",
14
  # "nest-asyncio==1.6.0",
15
  # "numba==0.61.0",
16
- # "numpy==2.1.3",
17
  # "polars==1.24.0",
18
  # ]
19
  # ///
 
6
  # "starlette",
7
  # "python-dotenv",
8
  # "pydantic",
9
+ # "duckdb==1.4.4",
10
+ # "altair==6.0.0",
11
  # "beautifulsoup4==4.13.3",
12
  # "httpx==0.28.1",
13
  # "marimo",
14
  # "nest-asyncio==1.6.0",
15
  # "numba==0.61.0",
16
+ # "numpy==2.4.3",
17
  # "polars==1.24.0",
18
  # ]
19
  # ///
altair/01_introduction.py ADDED
@@ -0,0 +1,671 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # "vega_datasets==0.9.0",
8
+ # ]
9
+ # ///
10
+
11
+ import marimo
12
+
13
+ __generated_with = "0.20.4"
14
+ app = marimo.App()
15
+
16
+
17
+ @app.cell
18
+ def _():
19
+ import marimo as mo
20
+
21
+ return (mo,)
22
+
23
+
24
+ @app.cell(hide_code=True)
25
+ def _(mo):
26
+ mo.md(r"""
27
+ # Introduction to Altair
28
+
29
+ [Altair](https://altair-viz.github.io/) is a declarative statistical visualization library for Python. Altair offers a powerful and concise visualization grammar for quickly building a wide range of statistical graphics.
30
+
31
+ By *declarative*, we mean that you can provide a high-level specification of *what* you want the visualization to include, in terms of *data*, *graphical marks*, and *encoding channels*, rather than having to specify *how* to implement the visualization in terms of for-loops, low-level drawing commands, *etc*. The key idea is that you declare links between data fields and visual encoding channels, such as the x-axis, y-axis, color, *etc*. The rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising range of simple to sophisticated visualizations can be created using a concise grammar.
32
+
33
+ Altair is based on [Vega-Lite](https://vega.github.io/vega-lite/), a high-level grammar of interactive graphics. Altair provides a friendly Python [API (Application Programming Interface)](https://en.wikipedia.org/wiki/Application_programming_interface) that generates Vega-Lite specifications in [JSON (JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) format. Environments such as Jupyter Notebooks, JupyterLab, and Colab can then take this specification and render it directly in the web browser. To learn more about the motivation and basic concepts behind Altair and Vega-Lite, watch the [Vega-Lite presentation video from OpenVisConf 2017](https://www.youtube.com/watch?v=9uaHRWj04D4).
34
+
35
+ This notebook will guide you through the basic process of creating visualizations in Altair. First, you will need to make sure you have the Altair package and its dependencies installed (for more, see the [Altair installation documentation](https://altair-viz.github.io/getting_started/installation.html)), or you are using a notebook environment that includes the dependencies pre-installed.
36
+
37
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
38
+ """)
39
+ return
40
+
41
+
42
+ @app.cell(hide_code=True)
43
+ def _(mo):
44
+ mo.md(r"""
45
+ ## Imports
46
+
47
+ To start, we must import the necessary libraries: Pandas for data frames and Altair for visualization.
48
+ """)
49
+ return
50
+
51
+
52
+ @app.cell
53
+ def _():
54
+ import pandas as pd
55
+ import altair as alt
56
+
57
+ return alt, pd
58
+
59
+
60
+ @app.cell(hide_code=True)
61
+ def _(mo):
62
+ mo.md(r"""
63
+ ## Renderers
64
+
65
+ Depending on your environment, you may need to specify a [renderer](https://altair-viz.github.io/user_guide/display_frontends.html) for Altair. If you are using __JupyterLab__, __Jupyter Notebook__, or __Google Colab__ with a live Internet connection you should not need to do anything. Otherwise, please read the documentation for [Displaying Altair Charts](https://altair-viz.github.io/user_guide/display_frontends.html).
66
+ """)
67
+ return
68
+
69
+
70
+ @app.cell(hide_code=True)
71
+ def _(mo):
72
+ mo.md(r"""
73
+ ## Data
74
+
75
+ Data in Altair is built around the Pandas data frame, which consists of a set of named data *columns*. We will also regularly refer to data columns as data *fields*.
76
+
77
+ When using Altair, datasets are commonly provided as data frames. Alternatively, Altair can also accept a URL to load a network-accessible dataset. As we will see, the named columns of the data frame are an essential piece of plotting with Altair.
78
+
79
+ We will often use datasets from the [vega-datasets](https://github.com/vega/vega-datasets) repository. Some of these datasets are directly available as Pandas data frames:
80
+ """)
81
+ return
82
+
83
+
84
+ @app.cell
85
+ def _():
86
+ from vega_datasets import data # import vega_datasets
87
+ cars = data.cars() # load cars data as a Pandas data frame
88
+ cars.head() # display the first five rows
89
+ return cars, data
90
+
91
+
92
+ @app.cell(hide_code=True)
93
+ def _(mo):
94
+ mo.md(r"""
95
+ Datasets in the vega-datasets collection can also be accessed via URLs:
96
+ """)
97
+ return
98
+
99
+
100
+ @app.cell
101
+ def _(data):
102
+ data.cars.url
103
+ return
104
+
105
+
106
+ @app.cell(hide_code=True)
107
+ def _(mo):
108
+ mo.md(r"""
109
+ Dataset URLs can be passed directly to Altair (for supported formats like JSON and [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)), or loaded into a Pandas data frame like so:
110
+ """)
111
+ return
112
+
113
+
114
+ @app.cell
115
+ def _(data, pd):
116
+ pd.read_json(data.cars.url).head() # load JSON data into a data frame
117
+ return
118
+
119
+
120
+ @app.cell(hide_code=True)
121
+ def _(mo):
122
+ mo.md(r"""
123
+ For more information about data frames - and some useful transformations to prepare Pandas data frames for plotting with Altair! - see the [Specifying Data with Altair documentation](https://altair-viz.github.io/user_guide/data.html).
124
+ """)
125
+ return
126
+
127
+
128
+ @app.cell(hide_code=True)
129
+ def _(mo):
130
+ mo.md(r"""
131
+ ### Weather Data
132
+
133
+ Statistical visualization in Altair begins with ["tidy"](http://vita.had.co.nz/papers/tidy-data.html) data frames. Here, we'll start by creating a simple data frame (`df`) containing the average precipitation (`precip`) for a given `city` and `month` :
134
+ """)
135
+ return
136
+
137
+
138
+ @app.cell
139
+ def _(pd):
140
+ df = pd.DataFrame({
141
+ 'city': ['Seattle', 'Seattle', 'Seattle', 'New York', 'New York', 'New York', 'Chicago', 'Chicago', 'Chicago'],
142
+ 'month': ['Apr', 'Aug', 'Dec', 'Apr', 'Aug', 'Dec', 'Apr', 'Aug', 'Dec'],
143
+ 'precip': [2.68, 0.87, 5.31, 3.94, 4.13, 3.58, 3.62, 3.98, 2.56]
144
+ })
145
+
146
+ df
147
+ return (df,)
148
+
149
+
150
+ @app.cell(hide_code=True)
151
+ def _(mo):
152
+ mo.md(r"""
153
+ ## The Chart Object
154
+
155
+ The fundamental object in Altair is the `Chart`, which takes a data frame as a single argument:
156
+ """)
157
+ return
158
+
159
+
160
+ @app.cell
161
+ def _(alt, df):
162
+ _chart = alt.Chart(df)
163
+ return
164
+
165
+
166
+ @app.cell(hide_code=True)
167
+ def _(mo):
168
+ mo.md(r"""
169
+ So far, we have defined the `Chart` object and passed it the simple data frame we generated above. We have not yet told the chart to *do* anything with the data.
170
+ """)
171
+ return
172
+
173
+
174
+ @app.cell(hide_code=True)
175
+ def _(mo):
176
+ mo.md(r"""
177
+ ## Marks and Encodings
178
+
179
+ With a chart object in hand, we can now specify how we would like the data to be visualized. We first indicate what kind of graphical *mark* (geometric shape) we want to use to represent the data. We can set the `mark` attribute of the chart object using the the `Chart.mark_*` methods.
180
+
181
+ For example, we can show the data as a point using `Chart.mark_point()`:
182
+ """)
183
+ return
184
+
185
+
186
+ @app.cell
187
+ def _(alt, df):
188
+ alt.Chart(df).mark_point()
189
+ return
190
+
191
+
192
+ @app.cell(hide_code=True)
193
+ def _(mo):
194
+ mo.md(r"""
195
+ Here the rendering consists of one point per row in the dataset, all plotted on top of each other, since we have not yet specified positions for these points.
196
+
197
+ To visually separate the points, we can map various *encoding channels*, or *channels* for short, to fields in the dataset. For example, we could *encode* the field `city` of the data using the `y` channel, which represents the y-axis position of the points. To specify this, use the `encode` method:
198
+ """)
199
+ return
200
+
201
+
202
+ @app.cell
203
+ def _(alt, df):
204
+ alt.Chart(df).mark_point().encode(
205
+ y='city',
206
+ )
207
+ return
208
+
209
+
210
+ @app.cell(hide_code=True)
211
+ def _(mo):
212
+ mo.md(r"""
213
+ The `encode()` method builds a key-value mapping between encoding channels (such as `x`, `y`, `color`, `shape`, `size`, *etc.*) to fields in the dataset, accessed by field name. For Pandas data frames, Altair automatically determines an appropriate data type for the mapped column, which in this case is the *nominal* type, indicating unordered, categorical values.
214
+
215
+ Though we've now separated the data by one attribute, we still have multiple points overlapping within each category. Let's further separate these by adding an `x` encoding channel, mapped to the `'precip'` field:
216
+ """)
217
+ return
218
+
219
+
220
+ @app.cell
221
+ def _(alt, df):
222
+ alt.Chart(df).mark_point().encode(
223
+ x='precip',
224
+ y='city'
225
+ )
226
+ return
227
+
228
+
229
+ @app.cell(hide_code=True)
230
+ def _(mo):
231
+ mo.md(r"""
232
+ _Seattle exhibits both the least-rainiest and most-rainiest months!_
233
+
234
+ The data type of the `'precip'` field is again automatically inferred by Altair, and this time is treated as a *quantitative* type (that is, a real-valued number). We see that grid lines and appropriate axis titles are automatically added as well.
235
+
236
+ Above we have specified key-value pairs using keyword arguments (`x='precip'`). In addition, Altair provides construction methods for encoding definitions, using the syntax `alt.X('precip')`. This alternative is useful for providing more parameters to an encoding, as we will see later in this notebook.
237
+ """)
238
+ return
239
+
240
+
241
+ @app.cell
242
+ def _(alt, df):
243
+ alt.Chart(df).mark_point().encode(
244
+ alt.X('precip'),
245
+ alt.Y('city')
246
+ )
247
+ return
248
+
249
+
250
+ @app.cell(hide_code=True)
251
+ def _(mo):
252
+ mo.md(r"""
253
+ The two styles of specifying encodings can be interleaved: `x='precip', alt.Y('city')` is also a valid input to the `encode` function.
254
+
255
+ In the examples above, the data type for each field is inferred automatically based on its type within the Pandas data frame. We can also explicitly indicate the data type to Altair by annotating the field name:
256
+
257
+ - `'b:N'` indicates a *nominal* type (unordered, categorical data),
258
+ - `'b:O'` indicates an *ordinal* type (rank-ordered data),
259
+ - `'b:Q'` indicates a *quantitative* type (numerical data with meaningful magnitudes), and
260
+ - `'b:T'` indicates a *temporal* type (date/time data)
261
+
262
+ For example, `alt.X('precip:N')`.
263
+
264
+ Explicit annotation of data types is necessary when data is loaded from an external URL directly by Vega-Lite (skipping Pandas entirely), or when we wish to use a type that differs from the type that was automatically inferred.
265
+
266
+ What do you think will happen to our chart above if we treat `precip` as a nominal or ordinal variable, rather than a quantitative variable? _Modify the code above and find out!_
267
+
268
+ We will take a closer look at data types and encoding channels in the next notebook of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum#data-visualization-curriculum).
269
+ """)
270
+ return
271
+
272
+
273
+ @app.cell(hide_code=True)
274
+ def _(mo):
275
+ mo.md(r"""
276
+ ## Data Transformation: Aggregation
277
+
278
+ To allow for more flexibility in how data are visualized, Altair has a built-in syntax for *aggregation* of data. For example, we can compute the average of all values by specifying an aggregation function along with the field name:
279
+ """)
280
+ return
281
+
282
+
283
+ @app.cell
284
+ def _(alt, df):
285
+ alt.Chart(df).mark_point().encode(
286
+ x='average(precip)',
287
+ y='city'
288
+ )
289
+ return
290
+
291
+
292
+ @app.cell(hide_code=True)
293
+ def _(mo):
294
+ mo.md(r"""
295
+ Now within each x-axis category, we see a single point reflecting the *average* of the values within that category.
296
+
297
+ _Does Seattle really have the lowest average precipitation of these cities? (It does!) Still, how might this plot mislead? Which months are included? What counts as precipitation?_
298
+
299
+ Altair supports a variety of aggregation functions, including `count`, `min` (minimum), `max` (maximum), `average`, `median`, and `stdev` (standard deviation). In a later notebook, we will take a tour of data transformations, including aggregation, sorting, filtering, and creation of new derived fields using calculation formulas.
300
+ """)
301
+ return
302
+
303
+
304
+ @app.cell(hide_code=True)
305
+ def _(mo):
306
+ mo.md(r"""
307
+ ## Changing the Mark Type
308
+
309
+ Let's say we want to represent our aggregated values using rectangular bars rather than circular points. We can do this by replacing `Chart.mark_point` with `Chart.mark_bar`:
310
+ """)
311
+ return
312
+
313
+
314
+ @app.cell
315
+ def _(alt, df):
316
+ alt.Chart(df).mark_bar().encode(
317
+ x='average(precip)',
318
+ y='city'
319
+ )
320
+ return
321
+
322
+
323
+ @app.cell(hide_code=True)
324
+ def _(mo):
325
+ mo.md(r"""
326
+ Because the nominal field `a` is mapped to the `y`-axis, the result is a horizontal bar chart. To get a vertical bar chart, we can simply swap the `x` and `y` keywords:
327
+ """)
328
+ return
329
+
330
+
331
+ @app.cell
332
+ def _(alt, df):
333
+ alt.Chart(df).mark_bar().encode(
334
+ x='city',
335
+ y='average(precip)'
336
+ )
337
+ return
338
+
339
+
340
+ @app.cell(hide_code=True)
341
+ def _(mo):
342
+ mo.md(r"""
343
+ ## Customizing a Visualization
344
+
345
+ By default Altair / Vega-Lite make some choices about properties of the visualization, but these can be changed using methods to customize the look of the visualization. For example, we can specify the axis titles using the `axis` attribute of channel classes, we can modify scale properties using the `scale` attribute, and we can specify the color of the marking by setting the `color` keyword of the `Chart.mark_*` methods to any valid [CSS color string](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value):
346
+ """)
347
+ return
348
+
349
+
350
+ @app.cell
351
+ def _(alt, df):
352
+ alt.Chart(df).mark_point(color='firebrick').encode(
353
+ alt.X('precip', scale=alt.Scale(type='log'), axis=alt.Axis(title='Log-Scaled Values')),
354
+ alt.Y('city', axis=alt.Axis(title='Category')),
355
+ )
356
+ return
357
+
358
+
359
+ @app.cell(hide_code=True)
360
+ def _(mo):
361
+ mo.md(r"""
362
+ A subsequent module will explore the various options available for scales, axes, and legends to create customized charts.
363
+ """)
364
+ return
365
+
366
+
367
+ @app.cell(hide_code=True)
368
+ def _(mo):
369
+ mo.md(r"""
370
+ ## Multiple Views
371
+
372
+ As we've seen above, the Altair `Chart` object represents a plot with a single mark type. What about more complicated diagrams, involving multiple charts or layers? Using a set of *view composition* operators, Altair can take multiple chart definitions and combine them to create more complex views.
373
+
374
+ As a starting point, let's plot the cars dataset in a line chart showing the average mileage by the year of manufacture:
375
+ """)
376
+ return
377
+
378
+
379
+ @app.cell
380
+ def _(alt, cars):
381
+ alt.Chart(cars).mark_line().encode(
382
+ alt.X('Year'),
383
+ alt.Y('average(Miles_per_Gallon)')
384
+ )
385
+ return
386
+
387
+
388
+ @app.cell(hide_code=True)
389
+ def _(mo):
390
+ mo.md(r"""
391
+ To augment this plot, we might like to add `circle` marks for each averaged data point. (The `circle` mark is just a convenient shorthand for `point` marks that used filled circles.)
392
+
393
+ We can start by defining each chart separately: first a line plot, then a scatter plot. We can then use the `layer` operator to combine the two into a layered chart. Here we use the shorthand `+` (plus) operator to invoke layering:
394
+ """)
395
+ return
396
+
397
+
398
+ @app.cell
399
+ def _(alt, cars):
400
+ line = alt.Chart(cars).mark_line().encode(
401
+ alt.X('Year'),
402
+ alt.Y('average(Miles_per_Gallon)')
403
+ )
404
+
405
+ point = alt.Chart(cars).mark_circle().encode(
406
+ alt.X('Year'),
407
+ alt.Y('average(Miles_per_Gallon)')
408
+ )
409
+
410
+ line + point
411
+ return
412
+
413
+
414
+ @app.cell(hide_code=True)
415
+ def _(mo):
416
+ mo.md(r"""
417
+ We can also create this chart by *reusing* and *modifying* a previous chart definition! Rather than completely re-write a chart, we can start with the line chart, then invoke the `mark_point` method to generate a new chart definition with a different mark type:
418
+ """)
419
+ return
420
+
421
+
422
+ @app.cell
423
+ def _(alt, cars):
424
+ mpg = alt.Chart(cars).mark_line().encode(
425
+ alt.X('Year'),
426
+ alt.Y('average(Miles_per_Gallon)')
427
+ )
428
+
429
+ mpg + mpg.mark_circle()
430
+ return (mpg,)
431
+
432
+
433
+ @app.cell(hide_code=True)
434
+ def _(mo):
435
+ mo.md(r"""
436
+ <em>(The need to place points on lines is so common, the `line` mark also includes a shorthand to generate a new layer for you. Trying adding the argument `point=True` to the `mark_line` method!)</em>
437
+
438
+ Now, what if we'd like to see this chart alongside other plots, such as the average horsepower over time?
439
+
440
+ We can use *concatenation* operators to place multiple charts side-by-side, either vertically or horizontally. Here, we'll use the `|` (pipe) operator to perform horizontal concatenation of two charts:
441
+ """)
442
+ return
443
+
444
+
445
+ @app.cell
446
+ def _(alt, cars, mpg):
447
+ hp = alt.Chart(cars).mark_line().encode(
448
+ alt.X('Year'),
449
+ alt.Y('average(Horsepower)')
450
+ )
451
+
452
+ (mpg + mpg.mark_circle()) | (hp + hp.mark_circle())
453
+ return
454
+
455
+
456
+ @app.cell(hide_code=True)
457
+ def _(mo):
458
+ mo.md(r"""
459
+ _We can see that, in this dataset, over the 1970s and early '80s the average fuel efficiency improved while the average horsepower decreased._
460
+
461
+ A later notebook will focus on *view composition*, including not only layering and concatenation, but also the `facet` operator for splitting data into sub-plots and the `repeat` operator to concisely generate concatenated charts from a template.
462
+ """)
463
+ return
464
+
465
+
466
+ @app.cell(hide_code=True)
467
+ def _(mo):
468
+ mo.md(r"""
469
+ ## Interactivity
470
+
471
+ In addition to basic plotting and view composition, one of Altair and Vega-Lite's most exciting features is its support for interaction.
472
+
473
+ To create a simple interactive plot that supports panning and zooming, we can invoke the `interactive()` method of the `Chart` object. In the chart below, click and drag to *pan* or use the scroll wheel to *zoom*:
474
+ """)
475
+ return
476
+
477
+
478
+ @app.cell
479
+ def _(alt, cars):
480
+ alt.Chart(cars).mark_point().encode(
481
+ x='Horsepower',
482
+ y='Miles_per_Gallon',
483
+ color='Origin',
484
+ ).interactive()
485
+ return
486
+
487
+
488
+ @app.cell(hide_code=True)
489
+ def _(mo):
490
+ mo.md(r"""
491
+ To provide more details upon mouse hover, we can use the `tooltip` encoding channel:
492
+ """)
493
+ return
494
+
495
+
496
+ @app.cell
497
+ def _(alt, cars):
498
+ alt.Chart(cars).mark_point().encode(
499
+ x='Horsepower',
500
+ y='Miles_per_Gallon',
501
+ color='Origin',
502
+ tooltip=['Name', 'Origin'] # show Name and Origin in a tooltip
503
+ ).interactive()
504
+ return
505
+
506
+
507
+ @app.cell(hide_code=True)
508
+ def _(mo):
509
+ mo.md(r"""
510
+ For more complex interactions, such as linked charts and cross-filtering, Altair provides a *selection* abstraction for defining interactive selections and then binding them to components of a chart. We will cover this is in detail in a later notebook.
511
+
512
+ Below is a more complex example. The upper histogram shows the count of cars per year and uses an interactive selection to modify the opacity of points in the lower scatter plot, which shows horsepower versus mileage.
513
+
514
+ _Drag out an interval in the upper chart and see how it affects the points in the lower chart. As you examine the code, **don't worry if parts don't make sense yet!** This is an aspirational example, and we will fill in all the needed details over the course of the different notebooks._
515
+ """)
516
+ return
517
+
518
+
519
+ @app.cell
520
+ def _(alt, cars):
521
+ # create an interval selection over an x-axis encoding
522
+ brush = alt.selection_interval(encodings=['x'])
523
+
524
+ # determine opacity based on brush
525
+ opacity = alt.condition(brush, alt.value(0.9), alt.value(0.1))
526
+
527
+ # an overview histogram of cars per year
528
+ # add the interval brush to select cars over time
529
+ overview = alt.Chart(cars).mark_bar().encode(
530
+ alt.X('Year:O', timeUnit='year', # extract year unit, treat as ordinal
531
+ axis=alt.Axis(title=None, labelAngle=0) # no title, no label angle
532
+ ),
533
+ alt.Y('count()', title=None), # counts, no axis title
534
+ opacity=opacity
535
+ ).add_params(
536
+ brush # add interval brush selection to the chart
537
+ ).properties(
538
+ width=400, # set the chart width to 400 pixels
539
+ height=50 # set the chart height to 50 pixels
540
+ )
541
+
542
+ # a detail scatterplot of horsepower vs. mileage
543
+ # modulate point opacity based on the brush selection
544
+ detail = alt.Chart(cars).mark_point().encode(
545
+ alt.X('Horsepower'),
546
+ alt.Y('Miles_per_Gallon'),
547
+ # set opacity based on brush selection
548
+ opacity=opacity
549
+ ).properties(width=400) # set chart width to match the first chart
550
+
551
+ # vertically concatenate (vconcat) charts using the '&' operator
552
+ overview & detail
553
+ return
554
+
555
+
556
+ @app.cell(hide_code=True)
557
+ def _(mo):
558
+ mo.md(r"""
559
+ ## Aside: Examining the JSON Output
560
+
561
+ As a Python API to Vega-Lite, Altair's main purpose is to convert plot specifications to a JSON string that conforms to the Vega-Lite schema. Using the `Chart.to_json` method, we can inspect the JSON specification that Altair is exporting and sending to Vega-Lite:
562
+ """)
563
+ return
564
+
565
+
566
+ @app.cell
567
+ def _(alt, df):
568
+ _chart = alt.Chart(df).mark_bar().encode(x='average(precip)', y='city')
569
+ print(_chart.to_json())
570
+ return
571
+
572
+
573
+ @app.cell(hide_code=True)
574
+ def _(mo):
575
+ mo.md(r"""
576
+ Notice here that `encode(x='average(precip)')` has been expanded to a JSON structure with a `field` name, a `type` for the data, and includes an `aggregate` field. The `encode(y='city')` statement has been expanded similarly.
577
+
578
+ As we saw earlier, Altair's shorthand syntax includes a way to specify the type of the field as well:
579
+ """)
580
+ return
581
+
582
+
583
+ @app.cell
584
+ def _(alt):
585
+ _x = alt.X('average(precip):Q')
586
+ print(_x.to_json())
587
+ return
588
+
589
+
590
+ @app.cell(hide_code=True)
591
+ def _(mo):
592
+ mo.md(r"""
593
+ This short-hand is equivalent to spelling-out the attributes by name:
594
+ """)
595
+ return
596
+
597
+
598
+ @app.cell
599
+ def _(alt):
600
+ _x = alt.X(aggregate='average', field='precip', type='quantitative')
601
+ print(_x.to_json())
602
+ return
603
+
604
+
605
+ @app.cell(hide_code=True)
606
+ def _(mo):
607
+ mo.md(r"""
608
+ ## Publishing a Visualization
609
+
610
+ Once you have visualized your data, perhaps you would like to publish it somewhere on the web. This can be done straightforwardly using the [vega-embed JavaScript package](https://github.com/vega/vega-embed). A simple example of a stand-alone HTML document can be generated for any chart using the `Chart.save` method:
611
+
612
+ ```python
613
+ chart = alt.Chart(df).mark_bar().encode(
614
+ x='average(precip)',
615
+ y='city',
616
+ )
617
+ chart.save('chart.html')
618
+ ```
619
+
620
+
621
+ The basic HTML template produces output that looks like this, where the JSON specification for your plot produced by `Chart.to_json` should be stored in the `spec` JavaScript variable:
622
+
623
+ ```html
624
+ <!DOCTYPE html>
625
+ <html>
626
+ <head>
627
+ <script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
628
+ <script src="https://cdn.jsdelivr.net/npm/vega-lite@4"></script>
629
+ <script src="https://cdn.jsdelivr.net/npm/vega-embed@6"></script>
630
+ </head>
631
+ <body>
632
+ <div id="vis"></div>
633
+ <script>
634
+ (function(vegaEmbed) {
635
+ var spec = {}; /* JSON output for your chart's specification */
636
+ var embedOpt = {"mode": "vega-lite"}; /* Options for the embedding */
637
+
638
+ function showError(el, error){
639
+ el.innerHTML = ('<div style="color:red;">'
640
+ + '<p>JavaScript Error: ' + error.message + '</p>'
641
+ + "<p>This usually means there's a typo in your chart specification. "
642
+ + "See the javascript console for the full traceback.</p>"
643
+ + '</div>');
644
+ throw error;
645
+ }
646
+ const el = document.getElementById('vis');
647
+ vegaEmbed("#vis", spec, embedOpt)
648
+ .catch(error => showError(el, error));
649
+ })(vegaEmbed);
650
+ </script>
651
+ </body>
652
+ </html>
653
+ ```
654
+
655
+ The `Chart.save` method provides a convenient way to save such HTML output to file. For more information on embedding Altair/Vega-Lite, see the [documentation of the vega-embed project](https://github.com/vega/vega-embed).
656
+ """)
657
+ return
658
+
659
+
660
+ @app.cell(hide_code=True)
661
+ def _(mo):
662
+ mo.md(r"""
663
+ ## Next Steps
664
+
665
+ 🎉 Hooray, you've completed the introduction to Altair! In the next notebook, we will dive deeper into creating visualizations using Altair's model of data types, graphical marks, and visual encoding channels.
666
+ """)
667
+ return
668
+
669
+
670
+ if __name__ == "__main__":
671
+ app.run()
altair/02_marks_encoding.py ADDED
@@ -0,0 +1,1126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # "vega_datasets==0.9.0",
8
+ # ]
9
+ # ///
10
+
11
+ import marimo
12
+
13
+ __generated_with = "0.20.4"
14
+ app = marimo.App()
15
+
16
+
17
+ @app.cell
18
+ def _():
19
+ import marimo as mo
20
+
21
+ return (mo,)
22
+
23
+
24
+ @app.cell(hide_code=True)
25
+ def _(mo):
26
+ mo.md(r"""
27
+ # Data Types, Graphical Marks, and Visual Encoding Channels
28
+
29
+ A visualization represents data using a collection of _graphical marks_ (bars, lines, points, etc.). The attributes of a mark &mdash; such as its position, shape, size, or color &mdash; serve as _channels_ through which we can encode underlying data values.
30
+ """)
31
+ return
32
+
33
+
34
+ @app.cell(hide_code=True)
35
+ def _(mo):
36
+ mo.md(r"""
37
+ With a basic framework of _data types_, _marks_, and _encoding channels_, we can concisely create a wide variety of visualizations. In this notebook, we explore each of these elements and show how to use them to create custom statistical graphics.
38
+
39
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
40
+ """)
41
+ return
42
+
43
+
44
+ @app.cell
45
+ def _():
46
+ import pandas as pd
47
+ import altair as alt
48
+
49
+ return (alt,)
50
+
51
+
52
+ @app.cell(hide_code=True)
53
+ def _(mo):
54
+ mo.md(r"""
55
+ ## Global Development Data
56
+ """)
57
+ return
58
+
59
+
60
+ @app.cell(hide_code=True)
61
+ def _(mo):
62
+ mo.md(r"""
63
+ We will be visualizing global health and population data for a number of countries, over the time period of 1955 to 2005. The data was collected by the [Gapminder Foundation](https://www.gapminder.org/) and shared in [Hans Rosling's popular TED talk](https://www.youtube.com/watch?v=hVimVzgtD6w). If you haven't seen the talk, we encourage you to watch it first!
64
+
65
+ Let's first load the dataset from the [vega-datasets](https://github.com/vega/vega-datasets) collection into a Pandas data frame.
66
+ """)
67
+ return
68
+
69
+
70
+ @app.cell
71
+ def _():
72
+ from vega_datasets import data as vega_data
73
+ data = vega_data.gapminder()
74
+ return (data,)
75
+
76
+
77
+ @app.cell(hide_code=True)
78
+ def _(mo):
79
+ mo.md(r"""
80
+ How big is the data?
81
+ """)
82
+ return
83
+
84
+
85
+ @app.cell
86
+ def _(data):
87
+ data.shape
88
+ return
89
+
90
+
91
+ @app.cell(hide_code=True)
92
+ def _(mo):
93
+ mo.md(r"""
94
+ 693 rows and 6 columns! Let's take a peek at the data content:
95
+ """)
96
+ return
97
+
98
+
99
+ @app.cell
100
+ def _(data):
101
+ data.head(5)
102
+ return
103
+
104
+
105
+ @app.cell(hide_code=True)
106
+ def _(mo):
107
+ mo.md(r"""
108
+ For each `country` and `year` (in 5-year intervals), we have measures of fertility in terms of the number of children per woman (`fertility`), life expectancy in years (`life_expect`), and total population (`pop`).
109
+
110
+ We also see a `cluster` field with an integer code. What might this represent? We'll try and solve this mystery as we visualize the data!
111
+ """)
112
+ return
113
+
114
+
115
+ @app.cell(hide_code=True)
116
+ def _(mo):
117
+ mo.md(r"""
118
+ Let's also create a smaller data frame, filtered down to values for the year 2000 only:
119
+ """)
120
+ return
121
+
122
+
123
+ @app.cell
124
+ def _(data):
125
+ data2000 = data.loc[data['year'] == 2000]
126
+ return (data2000,)
127
+
128
+
129
+ @app.cell
130
+ def _(data2000):
131
+ data2000.head(5)
132
+ return
133
+
134
+
135
+ @app.cell(hide_code=True)
136
+ def _(mo):
137
+ mo.md(r"""
138
+ ## Data Types
139
+ """)
140
+ return
141
+
142
+
143
+ @app.cell(hide_code=True)
144
+ def _(mo):
145
+ mo.md(r"""
146
+ The first ingredient in effective visualization is the input data. Data values can represent different forms of measurement. What kinds of comparisons do those measurements support? And what kinds of visual encodings then support those comparisons?
147
+
148
+ We will start by looking at the basic data types that Altair uses to inform visual encoding choices. These data types determine the kinds of comparisons we can make, and thereby guide our visualization design decisions.
149
+
150
+ ### Nominal (N)
151
+
152
+ *Nominal* data (also called *categorical* data) consist of category names.
153
+
154
+ With nominal data we can compare the equality of values: *is value A the same or different than value B? (A = B)*, supporting statements like “A is equal to B” or “A is not equal to B”.
155
+ In the dataset above, the `country` field is nominal.
156
+
157
+ When visualizing nominal data we should readily be able to see if values are the same or different: position, color hue (blue, red, green, *etc.*), and shape can help. However, using a size channel to encode nominal data might mislead us, suggesting rank-order or magnitude differences among values that do not exist!
158
+
159
+ ### Ordinal (O)
160
+
161
+ *Ordinal* data consist of values that have a specific ordering.
162
+
163
+ With ordinal data we can compare the rank-ordering of values: *does value A come before or after value B? (A < B)*, supporting statements like “A is less than B” or “A is greater than B”.
164
+ In the dataset above, we can treat the `year` field as ordinal.
165
+
166
+ When visualizing ordinal data, we should perceive a sense of rank-order. Position, size, or color value (brightness) might be appropriate, where as color hue (which is not perceptually ordered) would be less appropriate.
167
+
168
+ ### Quantitative (Q)
169
+
170
+ With *quantitative* data we can measure numerical differences among values. There are multiple sub-types of quantitative data:
171
+
172
+ For *interval* data we can measure the distance (interval) between points: *what is the distance to value A from value B? (A - B)*, supporting statements such as “A is 12 units away from B”.
173
+
174
+ For *ratio* data the zero-point is meaningful and so we can also measure proportions or scale factors: *value A is what proportion of value B? (A / B)*, supporting statements such as “A is 10% of B” or “B is 7 times larger than A”.
175
+
176
+ In the dataset above, `year` is a quantitative interval field (the value of year "zero" is subjective), whereas `fertility` and `life_expect` are quantitative ratio fields (zero is meaningful for calculating proportions).
177
+ Vega-Lite represents quantitative data, but does not make a distinction between interval and ratio types.
178
+
179
+ Quantitative values can be visualized using position, size, or color value, among other channels. An axis with a zero baseline is essential for proportional comparisons of ratio values, but can be safely omitted for interval comparisons.
180
+
181
+ ### Temporal (T)
182
+
183
+ *Temporal* values measure time points or intervals. This type is a special case of quantitative values (timestamps) with rich semantics and conventions (i.e., the [Gregorian calendar](https://en.wikipedia.org/wiki/Gregorian_calendar)). The temporal type in Vega-Lite supports reasoning about time units (year, month, day, hour, etc.), and provides methods for requesting specific time intervals.
184
+
185
+ Example temporal values include date strings such as `“2019-01-04”` and `“Jan 04 2019”`, as well as standardized date-times such as the [ISO date-time format](https://en.wikipedia.org/wiki/ISO_8601): `“2019-01-04T17:50:35.643Z”`.
186
+
187
+ There are no temporal values in our global development dataset above, as the `year` field is simply encoded as an integer. For more details about using temporal data in Altair, see the [Times and Dates documentation](https://altair-viz.github.io/user_guide/times_and_dates.html).
188
+
189
+ ### Summary
190
+
191
+ These data types are not mutually exclusive, but rather form a hierarchy: ordinal data support nominal (equality) comparisons, while quantitative data support ordinal (rank-order) comparisons.
192
+
193
+ Moreover, these data types do _not_ provide a fixed categorization. Just because a data field is represented using a number doesn't mean we have to treat it as a quantitative type! For example, we might interpret a set of ages (10 years old, 20 years old, etc) as nominal (underage or overage), ordinal (grouped by year), or quantitative (calculate average age).
194
+
195
+ Now let's examine how to visually encode these data types!
196
+ """)
197
+ return
198
+
199
+
200
+ @app.cell(hide_code=True)
201
+ def _(mo):
202
+ mo.md(r"""
203
+ ## Encoding Channels
204
+
205
+ At the heart of Altair is the use of *encodings* that bind data fields (with a given data type) to available encoding *channels* of a chosen *mark* type. In this notebook we'll examine the following encoding channels:
206
+
207
+ - `x`: Horizontal (x-axis) position of the mark.
208
+ - `y`: Vertical (y-axis) position of the mark.
209
+ - `size`: Size of the mark. May correspond to area or length, depending on the mark type.
210
+ - `color`: Mark color, specified as a [legal CSS color](https://developer.mozilla.org/en-US/docs/Web/CSS/color_value).
211
+ - `opacity`: Mark opacity, ranging from 0 (fully transparent) to 1 (fully opaque).
212
+ - `shape`: Plotting symbol shape for `point` marks.
213
+ - `tooltip`: Tooltip text to display upon mouse hover over the mark.
214
+ - `order`: Mark ordering, determines line/area point order and drawing order.
215
+ - `column`: Facet the data into horizontally-aligned subplots.
216
+ - `row`: Facet the data into vertically-aligned subplots.
217
+
218
+ For a complete list of available channels, see the [Altair encoding documentation](https://altair-viz.github.io/user_guide/encodings/index.html).
219
+ """)
220
+ return
221
+
222
+
223
+ @app.cell(hide_code=True)
224
+ def _(mo):
225
+ mo.md(r"""
226
+ ### X
227
+
228
+ The `x` encoding channel sets a mark's horizontal position (x-coordinate). In addition, default choices of axis and title are made automatically. In the chart below, the choice of a quantitative data type results in a continuous linear axis scale:
229
+ """)
230
+ return
231
+
232
+
233
+ @app.cell
234
+ def _(alt, data2000):
235
+ alt.Chart(data2000).mark_point().encode(
236
+ alt.X('fertility:Q')
237
+ )
238
+ return
239
+
240
+
241
+ @app.cell(hide_code=True)
242
+ def _(mo):
243
+ mo.md(r"""
244
+ ### Y
245
+
246
+ The `y` encoding channel sets a mark's vertical position (y-coordinate). Here we've added the `cluster` field using an ordinal (`O`) data type. The result is a discrete axis that includes a sized band, with a default step size, for each unique value:
247
+ """)
248
+ return
249
+
250
+
251
+ @app.cell
252
+ def _(alt, data2000):
253
+ alt.Chart(data2000).mark_point().encode(
254
+ alt.X('fertility:Q'),
255
+ alt.Y('cluster:O')
256
+ )
257
+ return
258
+
259
+
260
+ @app.cell(hide_code=True)
261
+ def _(mo):
262
+ mo.md(r"""
263
+ _What happens to the chart above if you swap the `O` and `Q` field types?_
264
+
265
+ If we instead add the `life_expect` field as a quantitative (`Q`) variable, the result is a scatter plot with linear scales for both axes:
266
+ """)
267
+ return
268
+
269
+
270
+ @app.cell
271
+ def _(alt, data2000):
272
+ alt.Chart(data2000).mark_point().encode(
273
+ alt.X('fertility:Q'),
274
+ alt.Y('life_expect:Q')
275
+ )
276
+ return
277
+
278
+
279
+ @app.cell(hide_code=True)
280
+ def _(mo):
281
+ mo.md(r"""
282
+ By default, axes for linear quantitative scales include zero to ensure a proper baseline for comparing ratio-valued data. In some cases, however, a zero baseline may be meaningless or you may want to focus on interval comparisons. To disable automatic inclusion of zero, configure the scale mapping using the encoding `scale` attribute:
283
+ """)
284
+ return
285
+
286
+
287
+ @app.cell
288
+ def _(alt, data2000):
289
+ alt.Chart(data2000).mark_point().encode(
290
+ alt.X('fertility:Q', scale=alt.Scale(zero=False)),
291
+ alt.Y('life_expect:Q', scale=alt.Scale(zero=False))
292
+ )
293
+ return
294
+
295
+
296
+ @app.cell(hide_code=True)
297
+ def _(mo):
298
+ mo.md(r"""
299
+ Now the axis scales no longer include zero by default. Some padding still remains, as the axis domain end points are automatically snapped to _nice_ numbers like multiples of 5 or 10.
300
+
301
+ _What happens if you also add `nice=False` to the scale attribute above?_
302
+ """)
303
+ return
304
+
305
+
306
+ @app.cell(hide_code=True)
307
+ def _(mo):
308
+ mo.md(r"""
309
+ ### Size
310
+ """)
311
+ return
312
+
313
+
314
+ @app.cell(hide_code=True)
315
+ def _(mo):
316
+ mo.md(r"""
317
+ The `size` encoding channel sets a mark's size or extent. The meaning of the channel can vary based on the mark type. For `point` marks, the `size` channel maps to the pixel area of the plotting symbol, such that the diameter of the point matches the square root of the size value.
318
+
319
+ Let's augment our scatter plot by encoding population (`pop`) on the `size` channel. As a result, the chart now also includes a legend for interpreting the size values.
320
+ """)
321
+ return
322
+
323
+
324
+ @app.cell
325
+ def _(alt, data2000):
326
+ alt.Chart(data2000).mark_point().encode(
327
+ alt.X('fertility:Q'),
328
+ alt.Y('life_expect:Q'),
329
+ alt.Size('pop:Q')
330
+ )
331
+ return
332
+
333
+
334
+ @app.cell(hide_code=True)
335
+ def _(mo):
336
+ mo.md(r"""
337
+ In some cases we might be unsatisfied with the default size range. To provide a customized span of sizes, set the `range` parameter of the `scale` attribute to an array indicating the smallest and largest sizes. Here we update the size encoding to range from 0 pixels (for zero values) to 1,000 pixels (for the maximum value in the scale domain):
338
+ """)
339
+ return
340
+
341
+
342
+ @app.cell
343
+ def _(alt, data2000):
344
+ alt.Chart(data2000).mark_point().encode(
345
+ alt.X('fertility:Q'),
346
+ alt.Y('life_expect:Q'),
347
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000]))
348
+ )
349
+ return
350
+
351
+
352
+ @app.cell(hide_code=True)
353
+ def _(mo):
354
+ mo.md(r"""
355
+ ### Color and Opacity
356
+ """)
357
+ return
358
+
359
+
360
+ @app.cell(hide_code=True)
361
+ def _(mo):
362
+ mo.md(r"""
363
+ The `color` encoding channel sets a mark's color. The style of color encoding is highly dependent on the data type: nominal data will default to a multi-hued qualitative color scheme, whereas ordinal and quantitative data will use perceptually ordered color gradients.
364
+
365
+ Here, we encode the `cluster` field using the `color` channel and a nominal (`N`) data type, resulting in a distinct hue for each cluster value. Can you start to guess what the `cluster` field might indicate?
366
+ """)
367
+ return
368
+
369
+
370
+ @app.cell
371
+ def _(alt, data2000):
372
+ alt.Chart(data2000).mark_point().encode(
373
+ alt.X('fertility:Q'),
374
+ alt.Y('life_expect:Q'),
375
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
376
+ alt.Color('cluster:N')
377
+ )
378
+ return
379
+
380
+
381
+ @app.cell(hide_code=True)
382
+ def _(mo):
383
+ mo.md(r"""
384
+ If we prefer filled shapes, we can can pass a `filled=True` parameter to the `mark_point` method:
385
+ """)
386
+ return
387
+
388
+
389
+ @app.cell
390
+ def _(alt, data2000):
391
+ alt.Chart(data2000).mark_point(filled=True).encode(
392
+ alt.X('fertility:Q'),
393
+ alt.Y('life_expect:Q'),
394
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
395
+ alt.Color('cluster:N')
396
+ )
397
+ return
398
+
399
+
400
+ @app.cell(hide_code=True)
401
+ def _(mo):
402
+ mo.md(r"""
403
+ By default, Altair uses a bit of transparency to help combat over-plotting. We are free to further adjust the opacity, either by passing a default value to the `mark_*` method, or using a dedicated encoding channel.
404
+
405
+ Here we demonstrate how to provide a constant value to an encoding channel instead of binding a data field:
406
+ """)
407
+ return
408
+
409
+
410
+ @app.cell
411
+ def _(alt, data2000):
412
+ alt.Chart(data2000).mark_point(filled=True).encode(
413
+ alt.X('fertility:Q'),
414
+ alt.Y('life_expect:Q'),
415
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
416
+ alt.Color('cluster:N'),
417
+ alt.OpacityValue(0.5)
418
+ )
419
+ return
420
+
421
+
422
+ @app.cell(hide_code=True)
423
+ def _(mo):
424
+ mo.md(r"""
425
+ ### Shape
426
+ """)
427
+ return
428
+
429
+
430
+ @app.cell(hide_code=True)
431
+ def _(mo):
432
+ mo.md(r"""
433
+ The `shape` encoding channel sets the geometric shape used by `point` marks. Unlike the other channels we have seen so far, the `shape` channel can not be used by other mark types. The shape encoding channel should only be used with nominal data, as perceptual rank-order and magnitude comparisons are not supported.
434
+
435
+ Let's encode the `cluster` field using `shape` as well as `color`. Using multiple channels for the same underlying data field is known as a *redundant encoding*. The resulting chart combines both color and shape information into a single symbol legend:
436
+ """)
437
+ return
438
+
439
+
440
+ @app.cell
441
+ def _(alt, data2000):
442
+ alt.Chart(data2000).mark_point(filled=True).encode(
443
+ alt.X('fertility:Q'),
444
+ alt.Y('life_expect:Q'),
445
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
446
+ alt.Color('cluster:N'),
447
+ alt.OpacityValue(0.5),
448
+ alt.Shape('cluster:N')
449
+ )
450
+ return
451
+
452
+
453
+ @app.cell(hide_code=True)
454
+ def _(mo):
455
+ mo.md(r"""
456
+ ### Tooltips & Ordering
457
+ """)
458
+ return
459
+
460
+
461
+ @app.cell(hide_code=True)
462
+ def _(mo):
463
+ mo.md(r"""
464
+ By this point, you might feel a bit frustrated: we've built up a chart, but we still don't know what countries the visualized points correspond to! Let's add interactive tooltips to enable exploration.
465
+
466
+ The `tooltip` encoding channel determines tooltip text to show when a user moves the mouse cursor over a mark. Let's add a tooltip encoding for the `country` field, then investigate which countries are being represented.
467
+ """)
468
+ return
469
+
470
+
471
+ @app.cell
472
+ def _(alt, data2000):
473
+ alt.Chart(data2000).mark_point(filled=True).encode(
474
+ alt.X('fertility:Q'),
475
+ alt.Y('life_expect:Q'),
476
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
477
+ alt.Color('cluster:N'),
478
+ alt.OpacityValue(0.5),
479
+ alt.Tooltip('country')
480
+ )
481
+ return
482
+
483
+
484
+ @app.cell(hide_code=True)
485
+ def _(mo):
486
+ mo.md(r"""
487
+ As you mouse around you may notice that you can not select some of the points. For example, the largest dark blue circle corresponds to India, which is drawn on top of a country with a smaller population, preventing the mouse from hovering over that country. To fix this problem, we can use the `order` encoding channel.
488
+
489
+ The `order` encoding channel determines the order of data points, affecting both the order in which they are drawn and, for `line` and `area` marks, the order in which they are connected to one another.
490
+
491
+ Let's order the values in descending rank order by the population (`pop`), ensuring that smaller circles are drawn later than larger circles:
492
+ """)
493
+ return
494
+
495
+
496
+ @app.cell
497
+ def _(alt, data2000):
498
+ alt.Chart(data2000).mark_point(filled=True).encode(
499
+ alt.X('fertility:Q'),
500
+ alt.Y('life_expect:Q'),
501
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
502
+ alt.Color('cluster:N'),
503
+ alt.OpacityValue(0.5),
504
+ alt.Tooltip('country:N'),
505
+ alt.Order('pop:Q', sort='descending')
506
+ )
507
+ return
508
+
509
+
510
+ @app.cell(hide_code=True)
511
+ def _(mo):
512
+ mo.md(r"""
513
+ Now we can identify the smaller country being obscured by India: it's Bangladesh!
514
+
515
+ We can also now figure out what the `cluster` field represents. Mouse over the various colored points to formulate your own explanation.
516
+ """)
517
+ return
518
+
519
+
520
+ @app.cell(hide_code=True)
521
+ def _(mo):
522
+ mo.md(r"""
523
+ At this point we've added tooltips that show only a single property of the underlying data record. To show multiple values, we can provide the `tooltip` channel an array of encodings, one for each field we want to include:
524
+ """)
525
+ return
526
+
527
+
528
+ @app.cell
529
+ def _(alt, data2000):
530
+ alt.Chart(data2000).mark_point(filled=True).encode(
531
+ alt.X('fertility:Q'),
532
+ alt.Y('life_expect:Q'),
533
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
534
+ alt.Color('cluster:N'),
535
+ alt.OpacityValue(0.5),
536
+ alt.Order('pop:Q', sort='descending'),
537
+ tooltip = [
538
+ alt.Tooltip('country:N'),
539
+ alt.Tooltip('fertility:Q'),
540
+ alt.Tooltip('life_expect:Q')
541
+ ]
542
+ )
543
+ return
544
+
545
+
546
+ @app.cell(hide_code=True)
547
+ def _(mo):
548
+ mo.md(r"""
549
+ Now we can see multiple data fields upon mouse over!
550
+ """)
551
+ return
552
+
553
+
554
+ @app.cell(hide_code=True)
555
+ def _(mo):
556
+ mo.md(r"""
557
+ ### Column and Row Facets
558
+ """)
559
+ return
560
+
561
+
562
+ @app.cell(hide_code=True)
563
+ def _(mo):
564
+ mo.md(r"""
565
+ Spatial position is one of the most powerful and flexible channels for visual encoding, but what can we do if we already have assigned fields to the `x` and `y` channels? One valuable technique is to create a *trellis plot*, consisting of sub-plots that show a subset of the data. A trellis plot is one example of the more general technique of presenting data using [small multiples](https://en.wikipedia.org/wiki/Small_multiple) of views.
566
+
567
+ The `column` and `row` encoding channels generate either a horizontal (columns) or vertical (rows) set of sub-plots, in which the data is partitioned according to the provided data field.
568
+
569
+ Here is a trellis plot that divides the data into one column per \`cluster\` value:
570
+ """)
571
+ return
572
+
573
+
574
+ @app.cell
575
+ def _(alt, data2000):
576
+ alt.Chart(data2000).mark_point(filled=True).encode(
577
+ alt.X('fertility:Q'),
578
+ alt.Y('life_expect:Q'),
579
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000])),
580
+ alt.Color('cluster:N'),
581
+ alt.OpacityValue(0.5),
582
+ alt.Tooltip('country:N'),
583
+ alt.Order('pop:Q', sort='descending'),
584
+ alt.Column('cluster:N')
585
+ )
586
+ return
587
+
588
+
589
+ @app.cell(hide_code=True)
590
+ def _(mo):
591
+ mo.md(r"""
592
+ The plot above does not fit on screen, making it difficult to compare all the sub-plots to each other! We can set the default `width` and `height` properties to create a smaller set of multiples. Also, as the column headers already label the `cluster` values, let's remove our `color` legend by setting it to `None`. To make better use of space we can also orient our `size` legend to the `'bottom'` of the chart.
593
+ """)
594
+ return
595
+
596
+
597
+ @app.cell
598
+ def _(alt, data2000):
599
+ alt.Chart(data2000).mark_point(filled=True).encode(
600
+ alt.X('fertility:Q'),
601
+ alt.Y('life_expect:Q'),
602
+ alt.Size('pop:Q', scale=alt.Scale(range=[0,1000]),
603
+ legend=alt.Legend(orient='bottom', titleOrient='left')),
604
+ alt.Color('cluster:N', legend=None),
605
+ alt.OpacityValue(0.5),
606
+ alt.Tooltip('country:N'),
607
+ alt.Order('pop:Q', sort='descending'),
608
+ alt.Column('cluster:N')
609
+ ).properties(width=135, height=135)
610
+ return
611
+
612
+
613
+ @app.cell(hide_code=True)
614
+ def _(mo):
615
+ mo.md(r"""
616
+ Underneath the hood, the `column` and `row` encodings are translated into a new specification that uses the `facet` view composition operator. We will re-visit faceting in greater depth later on!
617
+
618
+ In the meantime, _can you rewrite the chart above to facet into rows instead of columns?_
619
+ """)
620
+ return
621
+
622
+
623
+ @app.cell(hide_code=True)
624
+ def _(mo):
625
+ mo.md(r"""
626
+ ### A Peek Ahead: Interactive Filtering
627
+
628
+ In later modules, we'll dive into interaction techniques for data exploration. Here is a sneak peak: binding a range slider to the `year` field to enable interactive scrubbing through each year of data. Don't worry if the code below is a bit confusing at this point, as we will cover interaction in detail later.
629
+
630
+ _Drag the slider back and forth to see how the data values change over time!_
631
+ """)
632
+ return
633
+
634
+
635
+ @app.cell
636
+ def _(alt, data):
637
+ select_year = alt.selection_point(
638
+ name='select', fields=['year'], value=[{'year': 1955}],
639
+ bind=alt.binding_range(min=1955, max=2005, step=5)
640
+ )
641
+
642
+ alt.Chart(data).mark_point(filled=True).encode(
643
+ alt.X('fertility:Q', scale=alt.Scale(domain=[0,9])),
644
+ alt.Y('life_expect:Q', scale=alt.Scale(domain=[0,90])),
645
+ alt.Size('pop:Q', scale=alt.Scale(domain=[0, 1200000000], range=[0,1000])),
646
+ alt.Color('cluster:N', legend=None),
647
+ alt.OpacityValue(0.5),
648
+ alt.Tooltip('country:N'),
649
+ alt.Order('pop:Q', sort='descending')
650
+ ).add_params(select_year).transform_filter(select_year)
651
+ return
652
+
653
+
654
+ @app.cell(hide_code=True)
655
+ def _(mo):
656
+ mo.md(r"""
657
+ ## Graphical Marks
658
+
659
+ Our exploration of encoding channels above exclusively uses `point` marks to visualize the data. However, the `point` mark type is only one of the many geometric shapes that can be used to visually represent data. Altair includes a number of built-in mark types, including:
660
+
661
+ - `mark_area()` - Filled areas defined by a top-line and a baseline.
662
+ - `mark_bar()` - Rectangular bars.
663
+ - `mark_circle()` - Scatter plot points as filled circles.
664
+ - `mark_line()` - Connected line segments.
665
+ - `mark_point()` - Scatter plot points with configurable shapes.
666
+ - `mark_rect()` - Filled rectangles, useful for heatmaps.
667
+ - `mark_rule()` - Vertical or horizontal lines spanning the axis.
668
+ - `mark_square()` - Scatter plot points as filled squares.
669
+ - `mark_text()` - Scatter plot points represented by text.
670
+ - `mark_tick()` - Vertical or horizontal tick marks.
671
+
672
+ For a complete list, and links to examples, see the [Altair marks documentation](https://altair-viz.github.io/user_guide/marks/index.html). Next, we will step through a number of the most commonly used mark types for statistical graphics.
673
+ """)
674
+ return
675
+
676
+
677
+ @app.cell(hide_code=True)
678
+ def _(mo):
679
+ mo.md(r"""
680
+ ### Point Marks
681
+
682
+ The `point` mark type conveys specific points, as in *scatter plots* and *dot plots*. In addition to `x` and `y` encoding channels (to specify 2D point positions), point marks can use `color`, `size`, and `shape` encodings to convey additional data fields.
683
+
684
+ Below is a dot plot of `fertility`, with the `cluster` field redundantly encoded using both the `y` and `shape` channels.
685
+ """)
686
+ return
687
+
688
+
689
+ @app.cell
690
+ def _(alt, data2000):
691
+ alt.Chart(data2000).mark_point().encode(
692
+ alt.X('fertility:Q'),
693
+ alt.Y('cluster:N'),
694
+ alt.Shape('cluster:N')
695
+ )
696
+ return
697
+
698
+
699
+ @app.cell(hide_code=True)
700
+ def _(mo):
701
+ mo.md(r"""
702
+ In addition to encoding channels, marks can be stylized by providing values to the `mark_*()` methods.
703
+
704
+ For example: point marks are drawn with stroked outlines by default, but can be specified to use `filled` shapes instead. Similarly, you can set a default `size` to set the total pixel area of the point mark.
705
+ """)
706
+ return
707
+
708
+
709
+ @app.cell
710
+ def _(alt, data2000):
711
+ alt.Chart(data2000).mark_point(filled=True, size=100).encode(
712
+ alt.X('fertility:Q'),
713
+ alt.Y('cluster:N'),
714
+ alt.Shape('cluster:N')
715
+ )
716
+ return
717
+
718
+
719
+ @app.cell(hide_code=True)
720
+ def _(mo):
721
+ mo.md(r"""
722
+ ### Circle Marks
723
+
724
+ The `circle` mark type is a convenient shorthand for `point` marks drawn as filled circles.
725
+ """)
726
+ return
727
+
728
+
729
+ @app.cell
730
+ def _(alt, data2000):
731
+ alt.Chart(data2000).mark_circle(size=100).encode(
732
+ alt.X('fertility:Q'),
733
+ alt.Y('cluster:N'),
734
+ alt.Shape('cluster:N')
735
+ )
736
+ return
737
+
738
+
739
+ @app.cell(hide_code=True)
740
+ def _(mo):
741
+ mo.md(r"""
742
+ ### Square Marks
743
+
744
+ The `square` mark type is a convenient shorthand for `point` marks drawn as filled squares.
745
+ """)
746
+ return
747
+
748
+
749
+ @app.cell
750
+ def _(alt, data2000):
751
+ alt.Chart(data2000).mark_square(size=100).encode(
752
+ alt.X('fertility:Q'),
753
+ alt.Y('cluster:N'),
754
+ alt.Shape('cluster:N')
755
+ )
756
+ return
757
+
758
+
759
+ @app.cell(hide_code=True)
760
+ def _(mo):
761
+ mo.md(r"""
762
+ ### Tick Marks
763
+
764
+ The `tick` mark type conveys a data point using a short line segment or "tick". These are particularly useful for comparing values along a single dimension with minimal overlap. A *dot plot* drawn with tick marks is sometimes referred to as a *strip plot*.
765
+ """)
766
+ return
767
+
768
+
769
+ @app.cell
770
+ def _(alt, data2000):
771
+ alt.Chart(data2000).mark_tick().encode(
772
+ alt.X('fertility:Q'),
773
+ alt.Y('cluster:N'),
774
+ alt.Shape('cluster:N')
775
+ )
776
+ return
777
+
778
+
779
+ @app.cell(hide_code=True)
780
+ def _(mo):
781
+ mo.md(r"""
782
+ ### Bar Marks
783
+
784
+ The \`bar\` mark type draws a rectangle with a position, width, and height.
785
+
786
+ The plot below is a simple bar chart of the population (\`pop\`) of each country.
787
+ """)
788
+ return
789
+
790
+
791
+ @app.cell
792
+ def _(alt, data2000):
793
+ alt.Chart(data2000).mark_bar().encode(
794
+ alt.X('country:N'),
795
+ alt.Y('pop:Q')
796
+ )
797
+ return
798
+
799
+
800
+ @app.cell(hide_code=True)
801
+ def _(mo):
802
+ mo.md(r"""
803
+ The bar width is set to a default size. We will discuss how to adjust the bar width later in this notebook. (A subsequent notebook will take a closer look at configuring axes, scales, and legends.)
804
+
805
+ Bars can also be stacked. Let's change the `x` encoding to use the `cluster` field, and encode `country` using the `color` channel. We'll also disable the legend (which would be very long with colors for all countries!) and use tooltips for the country name.
806
+ """)
807
+ return
808
+
809
+
810
+ @app.cell
811
+ def _(alt, data2000):
812
+ alt.Chart(data2000).mark_bar().encode(
813
+ alt.X('cluster:N'),
814
+ alt.Y('pop:Q'),
815
+ alt.Color('country:N', legend=None),
816
+ alt.Tooltip('country:N')
817
+ )
818
+ return
819
+
820
+
821
+ @app.cell(hide_code=True)
822
+ def _(mo):
823
+ mo.md(r"""
824
+ In the chart above, the use of the `color` encoding channel causes Altair / Vega-Lite to automatically stack the bar marks. Otherwise, bars would be drawn on top of each other! Try adding the parameter `stack=None` to the `y` encoding channel to see what happens if we don't apply stacking...
825
+ """)
826
+ return
827
+
828
+
829
+ @app.cell(hide_code=True)
830
+ def _(mo):
831
+ mo.md(r"""
832
+ The examples above create bar charts from a zero-baseline, and the `y` channel only encodes the non-zero value (or height) of the bar. However, the bar mark also allows you to specify starting and ending points to convey ranges.
833
+
834
+ The chart below uses the `x` (starting point) and `x2` (ending point) channels to show the range of life expectancies within each regional cluster. Below we use the `min` and `max` aggregation functions to determine the end points of the range; we will discuss aggregation in greater detail in the next notebook!
835
+
836
+ Alternatively, you can use `x` and `width` to provide a starting point plus offset, such that `x2 = x + width`.
837
+ """)
838
+ return
839
+
840
+
841
+ @app.cell
842
+ def _(alt, data2000):
843
+ alt.Chart(data2000).mark_bar().encode(
844
+ alt.X('min(life_expect):Q'),
845
+ alt.X2('max(life_expect):Q'),
846
+ alt.Y('cluster:N')
847
+ )
848
+ return
849
+
850
+
851
+ @app.cell(hide_code=True)
852
+ def _(mo):
853
+ mo.md(r"""
854
+ ### Line Marks
855
+
856
+ The `line` mark type connects plotted points with line segments, for example so that a line's slope conveys information about the rate of change.
857
+
858
+ Let's plot a line chart of fertility per country over the years, using the full, unfiltered global development data frame. We'll again hide the legend and use tooltips instead.
859
+ """)
860
+ return
861
+
862
+
863
+ @app.cell
864
+ def _(alt, data):
865
+ alt.Chart(data).mark_line().encode(
866
+ alt.X('year:O'),
867
+ alt.Y('fertility:Q'),
868
+ alt.Color('country:N', legend=None),
869
+ alt.Tooltip('country:N')
870
+ ).properties(
871
+ width=400
872
+ )
873
+ return
874
+
875
+
876
+ @app.cell(hide_code=True)
877
+ def _(mo):
878
+ mo.md(r"""
879
+ We can see interesting variations per country, but overall trends for lower numbers of children per family over time. Also note that we set a custom width of 400 pixels. _Try changing (or removing) the widths and see what happens!_
880
+
881
+ Let's change some of the default mark parameters to customize the plot. We can set the `strokeWidth` to determine the thickness of the lines and the `opacity` to add some transparency. By default, the `line` mark uses straight line segments to connect data points. In some cases we might want to smooth the lines. We can adjust the interpolation used to connect data points by setting the `interpolate` mark parameter. Let's use `'monotone'` interpolation to provide smooth lines that are also guaranteed not to inadvertently generate "false" minimum or maximum values as a result of the interpolation.
882
+ """)
883
+ return
884
+
885
+
886
+ @app.cell
887
+ def _(alt, data):
888
+ alt.Chart(data).mark_line(
889
+ strokeWidth=3,
890
+ opacity=0.5,
891
+ interpolate='monotone'
892
+ ).encode(
893
+ alt.X('year:O'),
894
+ alt.Y('fertility:Q'),
895
+ alt.Color('country:N', legend=None),
896
+ alt.Tooltip('country:N')
897
+ ).properties(
898
+ width=400
899
+ )
900
+ return
901
+
902
+
903
+ @app.cell(hide_code=True)
904
+ def _(mo):
905
+ mo.md(r"""
906
+ The `line` mark can also be used to create *slope graphs*, charts that highlight the change in value between two comparison points using line slopes.
907
+
908
+ Below let's create a slope graph comparing the populations of each country at minimum and maximum years in our full dataset: 1955 and 2005. We first create a new Pandas data frame filtered to those years, then use Altair to create the slope graph.
909
+
910
+ By default, Altair places the years close together. To better space out the years along the x-axis, we can indicate the size (in pixels) of discrete steps along the width of our chart as indicated by the comment below. Try adjusting the width `step` value below and see how the chart changes in response.
911
+ """)
912
+ return
913
+
914
+
915
+ @app.cell
916
+ def _(alt, data):
917
+ dataTime = data.loc[(data['year'] == 1955) | (data['year'] == 2005)]
918
+
919
+ alt.Chart(dataTime).mark_line(opacity=0.5).encode(
920
+ alt.X('year:O'),
921
+ alt.Y('pop:Q'),
922
+ alt.Color('country:N', legend=None),
923
+ alt.Tooltip('country:N')
924
+ ).properties(
925
+ width={"step": 50} # adjust the step parameter
926
+ )
927
+ return
928
+
929
+
930
+ @app.cell(hide_code=True)
931
+ def _(mo):
932
+ mo.md(r"""
933
+ ### Area Marks
934
+
935
+ The `area` mark type combines aspects of `line` and `bar` marks: it visualizes connections (slopes) among data points, but also shows a filled region, with one edge defaulting to a zero-valued baseline.
936
+ """)
937
+ return
938
+
939
+
940
+ @app.cell(hide_code=True)
941
+ def _(mo):
942
+ mo.md(r"""
943
+ The chart below is an area chart of population over time for just the United States:
944
+ """)
945
+ return
946
+
947
+
948
+ @app.cell
949
+ def _(alt, data):
950
+ dataUS = data.loc[data['country'] == 'United States']
951
+
952
+ alt.Chart(dataUS).mark_area().encode(
953
+ alt.X('year:O'),
954
+ alt.Y('fertility:Q')
955
+ )
956
+ return (dataUS,)
957
+
958
+
959
+ @app.cell(hide_code=True)
960
+ def _(mo):
961
+ mo.md(r"""
962
+ Similar to `line` marks, `area` marks support an `interpolate` parameter.
963
+ """)
964
+ return
965
+
966
+
967
+ @app.cell
968
+ def _(alt, dataUS):
969
+ alt.Chart(dataUS).mark_area(interpolate='monotone').encode(
970
+ alt.X('year:O'),
971
+ alt.Y('fertility:Q')
972
+ )
973
+ return
974
+
975
+
976
+ @app.cell(hide_code=True)
977
+ def _(mo):
978
+ mo.md(r"""
979
+ Similar to `bar` marks, `area` marks also support stacking. Here we create a new data frame with data for the three North American countries, then plot them using an `area` mark and a `color` encoding channel to stack by country.
980
+ """)
981
+ return
982
+
983
+
984
+ @app.cell
985
+ def _(alt, data):
986
+ dataNA = data.loc[
987
+ (data['country'] == 'United States') |
988
+ (data['country'] == 'Canada') |
989
+ (data['country'] == 'Mexico')
990
+ ]
991
+
992
+ alt.Chart(dataNA).mark_area().encode(
993
+ alt.X('year:O'),
994
+ alt.Y('pop:Q'),
995
+ alt.Color('country:N')
996
+ )
997
+ return (dataNA,)
998
+
999
+
1000
+ @app.cell(hide_code=True)
1001
+ def _(mo):
1002
+ mo.md(r"""
1003
+ By default, stacking is performed relative to a zero baseline. However, other `stack` options are available:
1004
+
1005
+ * `center` - to stack relative to a baseline in the center of the chart, creating a *streamgraph* visualization, and
1006
+ * `normalize` - to normalize the summed data at each stacking point to 100%, enabling percentage comparisons.
1007
+
1008
+ Below we adapt the chart by setting the `y` encoding `stack` attribute to `center`. What happens if you instead set it `normalize`?
1009
+ """)
1010
+ return
1011
+
1012
+
1013
+ @app.cell
1014
+ def _(alt, dataNA):
1015
+ alt.Chart(dataNA).mark_area().encode(
1016
+ alt.X('year:O'),
1017
+ alt.Y('pop:Q', stack='center'),
1018
+ alt.Color('country:N')
1019
+ )
1020
+ return
1021
+
1022
+
1023
+ @app.cell(hide_code=True)
1024
+ def _(mo):
1025
+ mo.md(r"""
1026
+ To disable stacking altogether, set the `stack` attribute to `None`. We can also add `opacity` as a default mark parameter to ensure we see the overlapping areas!
1027
+ """)
1028
+ return
1029
+
1030
+
1031
+ @app.cell
1032
+ def _(alt, dataNA):
1033
+ alt.Chart(dataNA).mark_area(opacity=0.5).encode(
1034
+ alt.X('year:O'),
1035
+ alt.Y('pop:Q', stack=None),
1036
+ alt.Color('country:N')
1037
+ )
1038
+ return
1039
+
1040
+
1041
+ @app.cell(hide_code=True)
1042
+ def _(mo):
1043
+ mo.md(r"""
1044
+ The `area` mark type also supports data-driven baselines, with both the upper and lower series determined by data fields. As with `bar` marks, we can use the `x` and `x2` (or `y` and `y2`) channels to provide end points for the area mark.
1045
+
1046
+ The chart below visualizes the range of minimum and maximum fertility, per year, for North American countries:
1047
+ """)
1048
+ return
1049
+
1050
+
1051
+ @app.cell
1052
+ def _(alt, dataNA):
1053
+ alt.Chart(dataNA).mark_area().encode(
1054
+ alt.X('year:O'),
1055
+ alt.Y('min(fertility):Q'),
1056
+ alt.Y2('max(fertility):Q')
1057
+ ).properties(
1058
+ width={"step": 40}
1059
+ )
1060
+ return
1061
+
1062
+
1063
+ @app.cell(hide_code=True)
1064
+ def _(mo):
1065
+ mo.md(r"""
1066
+ We can see a larger range of values in 1995, from just under 4 to just under 7. By 2005, both the overall fertility values and the variability have declined, centered around 2 children per familty.
1067
+ """)
1068
+ return
1069
+
1070
+
1071
+ @app.cell(hide_code=True)
1072
+ def _(mo):
1073
+ mo.md(r"""
1074
+ All the `area` mark examples above use a vertically oriented area. However, Altair and Vega-Lite support horizontal areas as well. Let's transpose the chart above, simply by swapping the `x` and `y` channels.
1075
+ """)
1076
+ return
1077
+
1078
+
1079
+ @app.cell
1080
+ def _(alt, dataNA):
1081
+ alt.Chart(dataNA).mark_area().encode(
1082
+ alt.Y('year:O'),
1083
+ alt.X('min(fertility):Q'),
1084
+ alt.X2('max(fertility):Q')
1085
+ ).properties(
1086
+ width={"step": 40}
1087
+ )
1088
+ return
1089
+
1090
+
1091
+ @app.cell(hide_code=True)
1092
+ def _(mo):
1093
+ mo.md(r"""
1094
+ ## Summary
1095
+
1096
+ We've completed our tour of data types, encoding channels, and graphical marks! You should now be well-equipped to further explore the space of encodings, mark types, and mark parameters. For a comprehensive reference &ndash; including features we've skipped over here! &ndash; see the Altair [marks](https://altair-viz.github.io/user_guide/marks/index.html) and [encoding](https://altair-viz.github.io/user_guide/encodings/index.html) documentation.
1097
+
1098
+ In the next module, we will look at the use of data transformations to create charts that summarize data or visualize new derived fields. In a later module, we'll examine how to further customize your charts by modifying scales, axes, and legends.
1099
+
1100
+ Interested in learning more about visual encoding?
1101
+ """)
1102
+ return
1103
+
1104
+
1105
+ @app.cell(hide_code=True)
1106
+ def _(mo):
1107
+ mo.md(r"""
1108
+ <img title="Bertin's Taxonomy of Visual Encoding Channels" src="https://cdn-images-1.medium.com/max/2000/1*jsb78Rr2cDy6zrE7j2IKig.png" style="max-width: 650px;"><br/>
1109
+
1110
+ <small>Bertin's taxonomy of visual encodings from <a href="https://books.google.com/books/about/Semiology_of_Graphics.html?id=X5caQwAACAAJ"><em>Sémiologie Graphique</em></a>, as adapted by <a href="https://bost.ocks.org/mike/">Mike Bostock</a>.</small>
1111
+ """)
1112
+ return
1113
+
1114
+
1115
+ @app.cell(hide_code=True)
1116
+ def _(mo):
1117
+ mo.md(r"""
1118
+ - The systematic study of marks, visual encodings, and backing data types was initiated by [Jacques Bertin](https://en.wikipedia.org/wiki/Jacques_Bertin) in his pioneering 1967 work [_Sémiologie Graphique (The Semiology of Graphics)_](https://books.google.com/books/about/Semiology_of_Graphics.html?id=X5caQwAACAAJ). The image above illustrates position, size, value (brightness), texture, color (hue), orientation, and shape channels, alongside Bertin's recommendations for the data types they support.
1119
+ - The framework of data types, marks, and channels also guides _automated_ visualization design tools, starting with [Mackinlay's APT (A Presentation Tool)](https://scholar.google.com/scholar?cluster=10191273548472217907) in 1986 and continuing in more recent systems such as [Voyager](http://idl.cs.washington.edu/papers/voyager/) and [Draco](http://idl.cs.washington.edu/papers/draco/).
1120
+ - The identification of nominal, ordinal, interval, and ratio types dates at least as far back as S. S. Steven's 1947 article [_On the theory of scales of measurement_](https://scholar.google.com/scholar?cluster=14356809180080326415).
1121
+ """)
1122
+ return
1123
+
1124
+
1125
+ if __name__ == "__main__":
1126
+ app.run()
altair/03_data_transformation.py ADDED
@@ -0,0 +1,641 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # ]
8
+ # ///
9
+
10
+ import marimo
11
+
12
+ __generated_with = "0.20.4"
13
+ app = marimo.App()
14
+
15
+
16
+ @app.cell
17
+ def _():
18
+ import marimo as mo
19
+
20
+ return (mo,)
21
+
22
+
23
+ @app.cell(hide_code=True)
24
+ def _(mo):
25
+ mo.md(r"""
26
+ # Data Transformation
27
+
28
+ In previous notebooks we learned how to use marks and visual encodings to represent individual data records. Here we will explore methods for *transforming* data, including the use of aggregates to summarize multiple records. Data transformation is an integral part of visualization: choosing the variables to show and their level of detail is just as important as choosing appropriate visual encodings. After all, it doesn't matter how well chosen your visual encodings are if you are showing the wrong information!
29
+
30
+ As you work through this module, we recommend that you open the [Altair Data Transformations documentation](https://altair-viz.github.io/user_guide/transform/) in another tab. It will be a useful resource if at any point you'd like more details or want to see what other transformations are available.
31
+
32
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
33
+ """)
34
+ return
35
+
36
+
37
+ @app.cell
38
+ def _():
39
+ import pandas as pd
40
+ import altair as alt
41
+
42
+ return alt, pd
43
+
44
+
45
+ @app.cell(hide_code=True)
46
+ def _(mo):
47
+ mo.md(r"""
48
+ ## The Movies Dataset
49
+ """)
50
+ return
51
+
52
+
53
+ @app.cell(hide_code=True)
54
+ def _(mo):
55
+ mo.md(r"""
56
+ We will be working with a table of data about motion pictures, taken from the [vega-datasets](https://vega.github.io/vega-datasets/) collection. The data includes variables such as the film name, director, genre, release date, ratings, and gross revenues. However, _be careful when working with this data_: the films are from unevenly sampled years, using data combined from multiple sources. If you dig in you will find issues with missing values and even some subtle errors! Nevertheless, the data should prove interesting to explore...
57
+
58
+ Let's retrieve the URL for the JSON data file from the vega_datasets package, and then read the data into a Pandas data frame so that we can inspect its contents.
59
+ """)
60
+ return
61
+
62
+
63
+ @app.cell
64
+ def _(pd):
65
+ movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
66
+ movies = pd.read_json(movies_url)
67
+ return movies, movies_url
68
+
69
+
70
+ @app.cell(hide_code=True)
71
+ def _(mo):
72
+ mo.md(r"""
73
+ How many rows (records) and columns (fields) are in the movies dataset?
74
+ """)
75
+ return
76
+
77
+
78
+ @app.cell
79
+ def _(movies):
80
+ movies.shape
81
+ return
82
+
83
+
84
+ @app.cell(hide_code=True)
85
+ def _(mo):
86
+ mo.md(r"""
87
+ Now let's peek at the first 5 rows of the table to get a sense of the fields and data types...
88
+ """)
89
+ return
90
+
91
+
92
+ @app.cell
93
+ def _(movies):
94
+ movies.head(5)
95
+ return
96
+
97
+
98
+ @app.cell(hide_code=True)
99
+ def _(mo):
100
+ mo.md(r"""
101
+ ## Histograms
102
+
103
+ We'll start our transformation tour by _binning_ data into discrete groups and _counting_ records to summarize those groups. The resulting plots are known as [_histograms_](https://en.wikipedia.org/wiki/Histogram).
104
+
105
+ Let's first look at unaggregated data: a scatter plot showing movie ratings from Rotten Tomatoes versus ratings from IMDB users. We'll provide data to Altair by passing the movies data URL to the `Chart` method. (We could also pass the Pandas data frame directly to get the same result.) We can then encode the Rotten Tomatoes and IMDB ratings fields using the `x` and `y` channels:
106
+ """)
107
+ return
108
+
109
+
110
+ @app.cell
111
+ def _(alt, movies_url):
112
+ alt.Chart(movies_url).mark_circle().encode(
113
+ alt.X('Rotten_Tomatoes_Rating:Q'),
114
+ alt.Y('IMDB_Rating:Q')
115
+ )
116
+ return
117
+
118
+
119
+ @app.cell(hide_code=True)
120
+ def _(mo):
121
+ mo.md(r"""
122
+ To summarize this data, we can *bin* a data field to group numeric values into discrete groups. Here we bin along the x-axis by adding `bin=True` to the `x` encoding channel. The result is a set of ten bins of equal step size, each corresponding to a span of ten ratings points.
123
+ """)
124
+ return
125
+
126
+
127
+ @app.cell
128
+ def _(alt, movies_url):
129
+ alt.Chart(movies_url).mark_circle().encode(
130
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=True),
131
+ alt.Y('IMDB_Rating:Q')
132
+ )
133
+ return
134
+
135
+
136
+ @app.cell(hide_code=True)
137
+ def _(mo):
138
+ mo.md(r"""
139
+ Setting `bin=True` uses default binning settings, but we can exercise more control if desired. Let's instead set the maximum bin count (`maxbins`) to 20, which has the effect of doubling the number of bins. Now each bin corresponds to a span of five ratings points.
140
+ """)
141
+ return
142
+
143
+
144
+ @app.cell
145
+ def _(alt, movies_url):
146
+ alt.Chart(movies_url).mark_circle().encode(
147
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
148
+ alt.Y('IMDB_Rating:Q')
149
+ )
150
+ return
151
+
152
+
153
+ @app.cell(hide_code=True)
154
+ def _(mo):
155
+ mo.md(r"""
156
+ With the data binned, let's now summarize the distribution of Rotten Tomatoes ratings. We will drop the IMDB ratings for now and instead use the `y` encoding channel to show an aggregate `count` of records, so that the vertical position of each point indicates the number of movies per Rotten Tomatoes rating bin.
157
+
158
+ As the `count` aggregate counts the number of total records in each bin regardless of the field values, we do not need to include a field name in the `y` encoding.
159
+ """)
160
+ return
161
+
162
+
163
+ @app.cell
164
+ def _(alt, movies_url):
165
+ alt.Chart(movies_url).mark_circle().encode(
166
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
167
+ alt.Y('count()')
168
+ )
169
+ return
170
+
171
+
172
+ @app.cell(hide_code=True)
173
+ def _(mo):
174
+ mo.md(r"""
175
+ To arrive at a standard histogram, let's change the mark type from `circle` to `bar`:
176
+ """)
177
+ return
178
+
179
+
180
+ @app.cell
181
+ def _(alt, movies_url):
182
+ alt.Chart(movies_url).mark_bar().encode(
183
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
184
+ alt.Y('count()')
185
+ )
186
+ return
187
+
188
+
189
+ @app.cell(hide_code=True)
190
+ def _(mo):
191
+ mo.md(r"""
192
+ _We can now examine the distribution of ratings more clearly: we can see fewer movies on the negative end, and a bit more movies on the high end, but a generally uniform distribution overall. Rotten Tomatoes ratings are determined by taking "thumbs up" and "thumbs down" judgments from film critics and calculating the percentage of positive reviews. It appears this approach does a good job of utilizing the full range of rating values._
193
+
194
+ Similarly, we can create a histogram for IMDB ratings by changing the field in the `x` encoding channel:
195
+ """)
196
+ return
197
+
198
+
199
+ @app.cell
200
+ def _(alt, movies_url):
201
+ alt.Chart(movies_url).mark_bar().encode(
202
+ alt.X('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
203
+ alt.Y('count()')
204
+ )
205
+ return
206
+
207
+
208
+ @app.cell(hide_code=True)
209
+ def _(mo):
210
+ mo.md(r"""
211
+ _In contrast to the more uniform distribution we saw before, IMDB ratings exhibit a bell-shaped (though [negatively skewed](https://en.wikipedia.org/wiki/Skewness)) distribution. IMDB ratings are formed by averaging scores (ranging from 1 to 10) provided by the site's users. We can see that this form of measurement leads to a different shape than the Rotten Tomatoes ratings. We can also see that the mode of the distribution is between 6.5 and 7: people generally enjoy watching movies, potentially explaining the positive bias!_
212
+
213
+ Now let's turn back to our scatter plot of Rotten Tomatoes and IMDB ratings. Here's what happens if we bin *both* axes of our original plot.
214
+ """)
215
+ return
216
+
217
+
218
+ @app.cell
219
+ def _(alt, movies_url):
220
+ alt.Chart(movies_url).mark_circle().encode(
221
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
222
+ alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
223
+ )
224
+ return
225
+
226
+
227
+ @app.cell(hide_code=True)
228
+ def _(mo):
229
+ mo.md(r"""
230
+ Detail is lost due to *overplotting*, with many points drawn directly on top of each other.
231
+
232
+ To form a two-dimensional histogram we can add a `count` aggregate as before. As both the `x` and `y` encoding channels are already taken, we must use a different encoding channel to convey the counts. Here is the result of using circular area by adding a *size* encoding channel.
233
+ """)
234
+ return
235
+
236
+
237
+ @app.cell
238
+ def _(alt, movies_url):
239
+ alt.Chart(movies_url).mark_circle().encode(
240
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
241
+ alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
242
+ alt.Size('count()')
243
+ )
244
+ return
245
+
246
+
247
+ @app.cell(hide_code=True)
248
+ def _(mo):
249
+ mo.md(r"""
250
+ Alternatively, we can encode counts using the `color` channel and change the mark type to `bar`. The result is a two-dimensional histogram in the form of a [*heatmap*](https://en.wikipedia.org/wiki/Heat_map).
251
+ """)
252
+ return
253
+
254
+
255
+ @app.cell
256
+ def _(alt, movies_url):
257
+ alt.Chart(movies_url).mark_bar().encode(
258
+ alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
259
+ alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
260
+ alt.Color('count()')
261
+ )
262
+ return
263
+
264
+
265
+ @app.cell(hide_code=True)
266
+ def _(mo):
267
+ mo.md(r"""
268
+ Compare the *size* and *color*-based 2D histograms above. Which encoding do you think should be preferred? Why? In which plot can you more precisely compare the magnitude of individual values? In which plot can you more accurately see the overall density of ratings?
269
+ """)
270
+ return
271
+
272
+
273
+ @app.cell(hide_code=True)
274
+ def _(mo):
275
+ mo.md(r"""
276
+ ## Aggregation
277
+
278
+ Counts are just one type of aggregate. We might also calculate summaries using measures such as the `average`, `median`, `min`, or `max`. The Altair documentation includes the [full set of available aggregation functions](https://altair-viz.github.io/user_guide/transform/aggregate.html#user-guide-aggregate-transform).
279
+
280
+ Let's look at some examples!
281
+ """)
282
+ return
283
+
284
+
285
+ @app.cell(hide_code=True)
286
+ def _(mo):
287
+ mo.md(r"""
288
+ ### Averages and Sorting
289
+
290
+ _Do different genres of films receive consistently different ratings from critics?_ As a first step towards answering this question, we might examine the [*average* (a.k.a. the *arithmetic mean*)](https://en.wikipedia.org/wiki/Arithmetic_mean) rating for each genre of movie.
291
+
292
+ Let's visualize genre along the `y` axis and plot `average` Rotten Tomatoes ratings along the `x` axis.
293
+ """)
294
+ return
295
+
296
+
297
+ @app.cell
298
+ def _(alt, movies_url):
299
+ alt.Chart(movies_url).mark_bar().encode(
300
+ alt.X('average(Rotten_Tomatoes_Rating):Q'),
301
+ alt.Y('Major_Genre:N')
302
+ )
303
+ return
304
+
305
+
306
+ @app.cell(hide_code=True)
307
+ def _(mo):
308
+ mo.md(r"""
309
+ _There does appear to be some interesting variation, but looking at the data as an alphabetical list is not very helpful for ranking critical reactions to the genres._
310
+
311
+ For a tidier picture, let's sort the genres in descending order of average rating. To do so, we will add a `sort` parameter to the `y` encoding channel, stating that we wish to sort by the *average* (`op`, the aggregate operation) Rotten Tomatoes rating (the `field`) in descending `order`.
312
+ """)
313
+ return
314
+
315
+
316
+ @app.cell
317
+ def _(alt, movies_url):
318
+ alt.Chart(movies_url).mark_bar().encode(
319
+ alt.X('average(Rotten_Tomatoes_Rating):Q'),
320
+ alt.Y('Major_Genre:N', sort=alt.EncodingSortField(
321
+ op='average', field='Rotten_Tomatoes_Rating', order='descending')
322
+ )
323
+ )
324
+ return
325
+
326
+
327
+ @app.cell(hide_code=True)
328
+ def _(mo):
329
+ mo.md(r"""
330
+ _The sorted plot suggests that critics think highly of documentaries, musicals, westerns, and dramas, but look down upon romantic comedies and horror films... and who doesn't love `null` movies!?_
331
+ """)
332
+ return
333
+
334
+
335
+ @app.cell(hide_code=True)
336
+ def _(mo):
337
+ mo.md(r"""
338
+ ### Medians and the Inter-Quartile Range
339
+
340
+ While averages are a common way to summarize data, they can sometimes mislead. For example, very large or very small values ([*outliers*](https://en.wikipedia.org/wiki/Outlier)) might skew the average. To be safe, we can compare the genres according to the [*median*](https://en.wikipedia.org/wiki/Median) ratings as well.
341
+
342
+ The median is a point that splits the data evenly, such that half of the values are less than the median and the other half are greater. The median is less sensitive to outliers and so is referred to as a [*robust* statistic](https://en.wikipedia.org/wiki/Robust_statistics). For example, arbitrarily increasing the largest rating value will not cause the median to change.
343
+
344
+ Let's update our plot to use a `median` aggregate and sort by those values:
345
+ """)
346
+ return
347
+
348
+
349
+ @app.cell
350
+ def _(alt, movies_url):
351
+ alt.Chart(movies_url).mark_bar().encode(
352
+ alt.X('median(Rotten_Tomatoes_Rating):Q'),
353
+ alt.Y('Major_Genre:N', sort=alt.EncodingSortField(
354
+ op='median', field='Rotten_Tomatoes_Rating', order='descending')
355
+ )
356
+ )
357
+ return
358
+
359
+
360
+ @app.cell(hide_code=True)
361
+ def _(mo):
362
+ mo.md(r"""
363
+ _We can see that some of the genres with similar averages have swapped places (films of unknown genre, or `null`, are now rated highest!), but the overall groups have stayed stable. Horror films continue to get little love from professional film critics._
364
+
365
+ It's a good idea to stay skeptical when viewing aggregate statistics. So far we've only looked at *point estimates*. We have not examined how ratings vary within a genre.
366
+
367
+ Let's visualize the variation among the ratings to add some nuance to our rankings. Here we will encode the [*inter-quartile range* (IQR)](https://en.wikipedia.org/wiki/Interquartile_range) for each genre. The IQR is the range in which the middle half of data values reside. A [*quartile*](https://en.wikipedia.org/wiki/Quartile) contains 25% of the data values. The inter-quartile range consists of the two middle quartiles, and so contains the middle 50%.
368
+
369
+ To visualize ranges, we can use the `x` and `x2` encoding channels to indicate the starting and ending points. We use the aggregate functions `q1` (the lower quartile boundary) and `q3` (the upper quartile boundary) to provide the inter-quartile range. (In case you are wondering, *q2* would be the median.)
370
+ """)
371
+ return
372
+
373
+
374
+ @app.cell
375
+ def _(alt, movies_url):
376
+ alt.Chart(movies_url).mark_bar().encode(
377
+ alt.X('q1(Rotten_Tomatoes_Rating):Q'),
378
+ alt.X2('q3(Rotten_Tomatoes_Rating):Q'),
379
+ alt.Y('Major_Genre:N', sort=alt.EncodingSortField(
380
+ op='median', field='Rotten_Tomatoes_Rating', order='descending')
381
+ )
382
+ )
383
+ return
384
+
385
+
386
+ @app.cell(hide_code=True)
387
+ def _(mo):
388
+ mo.md(r"""
389
+ ### Time Units
390
+
391
+ _Now let's ask a completely different question: do box office returns vary by season?_
392
+
393
+ To get an initial answer, let's plot the median U.S. gross revenue by month.
394
+
395
+ To make this chart, use the `timeUnit` transform to map release dates to the `month` of the year. The result is similar to binning, but using meaningful time intervals. Other valid time units include `year`, `quarter`, `date` (numeric day in month), `day` (day of the week), and `hours`, as well as compound units such as `yearmonth` or `hoursminutes`. See the Altair documentation for a [complete list of time units](https://altair-viz.github.io/user_guide/transform/timeunit.html#user-guide-timeunit-transform).
396
+ """)
397
+ return
398
+
399
+
400
+ @app.cell
401
+ def _(alt, movies_url):
402
+ alt.Chart(movies_url).mark_area().encode(
403
+ alt.X('month(Release_Date):T'),
404
+ alt.Y('median(US_Gross):Q')
405
+ )
406
+ return
407
+
408
+
409
+ @app.cell(hide_code=True)
410
+ def _(mo):
411
+ mo.md(r"""
412
+ _Looking at the resulting plot, median movie sales in the U.S. appear to spike around the summer blockbuster season and the end of year holiday period. Of course, people around the world (not just the U.S.) go out to the movies. Does a similar pattern arise for worldwide gross revenue?_
413
+ """)
414
+ return
415
+
416
+
417
+ @app.cell
418
+ def _(alt, movies_url):
419
+ alt.Chart(movies_url).mark_area().encode(
420
+ alt.X('month(Release_Date):T'),
421
+ alt.Y('median(Worldwide_Gross):Q')
422
+ )
423
+ return
424
+
425
+
426
+ @app.cell(hide_code=True)
427
+ def _(mo):
428
+ mo.md(r"""
429
+ _Yes!_
430
+ """)
431
+ return
432
+
433
+
434
+ @app.cell(hide_code=True)
435
+ def _(mo):
436
+ mo.md(r"""
437
+ ## Advanced Data Transformation
438
+
439
+ The examples above all use transformations (*bin*, *timeUnit*, *aggregate*, *sort*) that are defined relative to an encoding channel. However, at times you may want to apply a chain of multiple transformations prior to visualization, or use transformations that don't integrate into encoding definitions. For such cases, Altair and Vega-Lite support data transformations defined separately from encodings. These transformations are applied to the data *before* any encodings are considered.
440
+
441
+ We *could* also perform transformations using Pandas directly, and then visualize the result. However, using the built-in transforms allows our visualizations to be published more easily in other contexts; for example, exporting the Vega-Lite JSON to use in a stand-alone web interface. Let's look at the built-in transforms supported by Altair, such as `calculate`, `filter`, `aggregate`, and `window`.
442
+ """)
443
+ return
444
+
445
+
446
+ @app.cell(hide_code=True)
447
+ def _(mo):
448
+ mo.md(r"""
449
+ ### Calculate
450
+
451
+ _Think back to our comparison of U.S. gross and worldwide gross. Doesn't worldwide revenue include the U.S.? (Indeed it does.) How might we get a better sense of trends outside the U.S.?_
452
+
453
+ With the `calculate` transform we can derive new fields. Here we want to subtract U.S. gross from worldwide gross. The `calculate` transform takes a [Vega expression string](https://vega.github.io/vega/docs/expressions/) to define a formula over a single record. Vega expressions use JavaScript syntax. The `datum.` prefix accesses a field value on the input record.
454
+ """)
455
+ return
456
+
457
+
458
+ @app.cell
459
+ def _(alt, movies):
460
+ alt.Chart(movies).mark_area().transform_calculate(
461
+ NonUS_Gross='datum.Worldwide_Gross - datum.US_Gross'
462
+ ).encode(
463
+ alt.X('month(Release_Date):T'),
464
+ alt.Y('median(NonUS_Gross):Q')
465
+ )
466
+ return
467
+
468
+
469
+ @app.cell(hide_code=True)
470
+ def _(mo):
471
+ mo.md(r"""
472
+ _We can see that seasonal trends hold outside the U.S., but with a more pronounced decline in the non-peak months._
473
+ """)
474
+ return
475
+
476
+
477
+ @app.cell(hide_code=True)
478
+ def _(mo):
479
+ mo.md(r"""
480
+ ### Filter
481
+
482
+ The *filter* transform creates a new table with a subset of the original data, removing rows that fail to meet a provided [*predicate*](https://en.wikipedia.org/wiki/Predicate_%28mathematical_logic%29) test. Similar to the *calculate* transform, filter predicates are expressed using the [Vega expression language](https://vega.github.io/vega/docs/expressions/).
483
+
484
+ Below we add a filter to limit our initial scatter plot of IMDB vs. Rotten Tomatoes ratings to only films in the major genre of "Romantic Comedy".
485
+ """)
486
+ return
487
+
488
+
489
+ @app.cell
490
+ def _(alt, movies_url):
491
+ alt.Chart(movies_url).mark_circle().encode(
492
+ alt.X('Rotten_Tomatoes_Rating:Q'),
493
+ alt.Y('IMDB_Rating:Q')
494
+ ).transform_filter('datum.Major_Genre == "Romantic Comedy"')
495
+ return
496
+
497
+
498
+ @app.cell(hide_code=True)
499
+ def _(mo):
500
+ mo.md(r"""
501
+ _How does the plot change if we filter to view other genres? Edit the filter expression to find out._
502
+
503
+ Now let's filter to look at films released before 1970.
504
+ """)
505
+ return
506
+
507
+
508
+ @app.cell
509
+ def _(alt, movies_url):
510
+ alt.Chart(movies_url).mark_circle().encode(
511
+ alt.X('Rotten_Tomatoes_Rating:Q'),
512
+ alt.Y('IMDB_Rating:Q')
513
+ ).transform_filter('year(datum.Release_Date) < 1970')
514
+ return
515
+
516
+
517
+ @app.cell(hide_code=True)
518
+ def _(mo):
519
+ mo.md(r"""
520
+ _They seem to score unusually high! Are older films simply better, or is there a [selection bias](https://en.wikipedia.org/wiki/Selection%5Fbias) towards more highly-rated older films in this dataset?_
521
+ """)
522
+ return
523
+
524
+
525
+ @app.cell(hide_code=True)
526
+ def _(mo):
527
+ mo.md(r"""
528
+ ### Aggregate
529
+
530
+ We have already seen `aggregate` transforms such as `count` and `average` in the context of encoding channels. We can also specify aggregates separately, as a pre-processing step for other transforms (as in the `window` transform examples below). The output of an `aggregate` transform is a new data table with records that contain both the `groupby` fields and the computed `aggregate` measures.
531
+
532
+ Let's recreate our plot of average ratings by genre, but this time using a separate `aggregate` transform. The output table from the aggregate transform contains 13 rows, one for each genre.
533
+
534
+ To order the `y` axis we must include a required aggregate operation in our sorting instructions. Here we use the `max` operator, which works fine because there is only one output record per genre. We could similarly use the `min` operator and end up with the same plot.
535
+ """)
536
+ return
537
+
538
+
539
+ @app.cell
540
+ def _(alt, movies_url):
541
+ alt.Chart(movies_url).mark_bar().transform_aggregate(
542
+ groupby=['Major_Genre'],
543
+ Average_Rating='average(Rotten_Tomatoes_Rating)'
544
+ ).encode(
545
+ alt.X('Average_Rating:Q'),
546
+ alt.Y('Major_Genre:N', sort=alt.EncodingSortField(
547
+ op='max', field='Average_Rating', order='descending'
548
+ )
549
+ )
550
+ )
551
+ return
552
+
553
+
554
+ @app.cell(hide_code=True)
555
+ def _(mo):
556
+ mo.md(r"""
557
+ ### Window
558
+
559
+ The `window` transform performs calculations over sorted groups of data records. Window transforms are quite powerful, supporting tasks such as ranking, lead/lag analysis, cumulative totals, and running sums or averages. Values calculated by a `window` transform are written back to the input data table as new fields. Window operations include the aggregate operations we've seen earlier, as well as specialized operations such as `rank`, `row_number`, `lead`, and `lag`. The Vega-Lite documentation lists [all valid window operations](https://vega.github.io/vega-lite/docs/window.html#ops).
560
+
561
+ One use case for a `window` transform is to calculate top-k lists. Let's plot the top 20 directors in terms of total worldwide gross.
562
+
563
+ We first use a `filter` transform to remove records for which we don't know the director. Otherwise, the director `null` would dominate the list! We then apply an `aggregate` to sum up the worldwide gross for all films, grouped by director. At this point we could plot a sorted bar chart, but we'd end up with hundreds and hundreds of directors. How can we limit the display to the top 20?
564
+
565
+ The `window` transform allows us to determine the top directors by calculating their rank order. Within our `window` transform definition we can `sort` by gross and use the `rank` operation to calculate rank scores according to that sort order. We can then add a subsequent `filter` transform to limit the data to only records with a rank value less than or equal to 20.
566
+ """)
567
+ return
568
+
569
+
570
+ @app.cell
571
+ def _(alt, movies_url):
572
+ alt.Chart(movies_url).mark_bar().transform_filter(
573
+ 'datum.Director != null'
574
+ ).transform_aggregate(
575
+ Gross='sum(Worldwide_Gross)',
576
+ groupby=['Director']
577
+ ).transform_window(
578
+ Rank='rank()',
579
+ sort=[alt.SortField('Gross', order='descending')]
580
+ ).transform_filter(
581
+ 'datum.Rank < 20'
582
+ ).encode(
583
+ alt.X('Gross:Q'),
584
+ alt.Y('Director:N', sort=alt.EncodingSortField(
585
+ op='max', field='Gross', order='descending'
586
+ ))
587
+ )
588
+ return
589
+
590
+
591
+ @app.cell(hide_code=True)
592
+ def _(mo):
593
+ mo.md(r"""
594
+ _We can see that Steven Spielberg has been quite successful in his career! However, showing sums might favor directors who have had longer careers, and so have made more movies and thus more money. What happens if we change the choice of aggregate operation? Who is the most successful director in terms of `average` or `median` gross per film? Modify the aggregate transform above!_
595
+
596
+ Earlier in this notebook we looked at histograms, which approximate the [*probability density function*](https://en.wikipedia.org/wiki/Probability_density_function) of a set of values. A complementary approach is to look at the [*cumulative distribution*](https://en.wikipedia.org/wiki/Cumulative_distribution_function). For example, think of a histogram in which each bin includes not only its own count but also the counts from all previous bins &mdash; the result is a _running total_, with the last bin containing the total number of records. A cumulative chart directly shows us, for a given reference value, how many data values are less than or equal to that reference.
597
+
598
+ As a concrete example, let's look at the cumulative distribution of films by running time (in minutes). Only a subset of records actually include running time information, so we first `filter` down to the subset of films for which we have running times. Next, we apply an `aggregate` to count the number of films per duration (implicitly using "bins" of 1 minute each). We then use a `window` transform to compute a running total of counts across bins, sorted by increasing running time.
599
+ """)
600
+ return
601
+
602
+
603
+ @app.cell
604
+ def _(alt, movies_url):
605
+ alt.Chart(movies_url).mark_line(interpolate='step-before').transform_filter(
606
+ 'datum.Running_Time_min != null'
607
+ ).transform_aggregate(
608
+ groupby=['Running_Time_min'],
609
+ Count='count()',
610
+ ).transform_window(
611
+ Cumulative_Sum='sum(Count)',
612
+ sort=[alt.SortField('Running_Time_min', order='ascending')]
613
+ ).encode(
614
+ alt.X('Running_Time_min:Q', axis=alt.Axis(title='Duration (min)')),
615
+ alt.Y('Cumulative_Sum:Q', axis=alt.Axis(title='Cumulative Count of Films'))
616
+ )
617
+ return
618
+
619
+
620
+ @app.cell(hide_code=True)
621
+ def _(mo):
622
+ mo.md(r"""
623
+ _Let's examine the cumulative distribution of film lengths. We can see that films under 110 minutes make up about half of all the films for which we have running times. We see a steady accumulation of films between 90 minutes and 2 hours, after which the distribution begins to taper off. Though rare, the dataset does contain multiple films more than 3 hours long!_
624
+ """)
625
+ return
626
+
627
+
628
+ @app.cell(hide_code=True)
629
+ def _(mo):
630
+ mo.md(r"""
631
+ ## Summary
632
+
633
+ We've only scratched the surface of what data transformations can do! For more details, including all the available transformations and their parameters, see the [Altair data transformation documentation](https://altair-viz.github.io/user_guide/transform/index.html).
634
+
635
+ Sometimes you will need to perform significant data transformation to prepare your data _prior_ to using visualization tools. To engage in [_data wrangling_](https://en.wikipedia.org/wiki/Data_wrangling) right here in Python, you can use the [Pandas library](https://pandas.pydata.org/).
636
+ """)
637
+ return
638
+
639
+
640
+ if __name__ == "__main__":
641
+ app.run()
altair/04_scales_axes_legends.py ADDED
@@ -0,0 +1,840 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # ]
8
+ # ///
9
+
10
+ import marimo
11
+
12
+ __generated_with = "0.20.4"
13
+ app = marimo.App()
14
+
15
+
16
+ @app.cell
17
+ def _():
18
+ import marimo as mo
19
+
20
+ return (mo,)
21
+
22
+
23
+ @app.cell(hide_code=True)
24
+ def _(mo):
25
+ mo.md(r"""
26
+ # Scales, Axes, and Legends
27
+
28
+ Visual encoding &ndash; mapping data to visual variables such as position, size, shape, or color &ndash; is the beating heart of data visualization. The workhorse that actually performs this mapping is the *scale*: a function that takes a data value as input (the scale *domain*) and returns a visual value, such as a pixel position or RGB color, as output (the scale *range*). Of course, a visualization is useless if no one can figure out what it conveys! In addition to graphical marks, a chart needs reference elements, or *guides*, that allow readers to decode the graphic. Guides such as *axes* (which visualize scales with spatial ranges) and *legends* (which visualize scales with color, size, or shape ranges), are the unsung heroes of effective data visualization!
29
+
30
+ In this notebook, we will explore the options Altair provides to support customized designs of scale mappings, axes, and legends, using a running example about the effectiveness of antibiotic drugs.
31
+
32
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
33
+ """)
34
+ return
35
+
36
+
37
+ @app.cell
38
+ def _():
39
+ import pandas as pd
40
+ import altair as alt
41
+
42
+ return alt, pd
43
+
44
+
45
+ @app.cell(hide_code=True)
46
+ def _(mo):
47
+ mo.md(r"""
48
+ ## Antibiotics Data
49
+ """)
50
+ return
51
+
52
+
53
+ @app.cell(hide_code=True)
54
+ def _(mo):
55
+ mo.md(r"""
56
+ After World War II, antibiotics were considered "wonder drugs", as they were an easy remedy for what had been intractable ailments. To learn which drug worked most effectively for which bacterial infection, performance of the three most popular antibiotics on 16 bacteria were gathered.
57
+ """)
58
+ return
59
+
60
+
61
+ @app.cell(hide_code=True)
62
+ def _(mo):
63
+ mo.md(r"""
64
+ We will be using an antibiotics dataset from the [vega-datasets collection](https://github.com/vega/vega-datasets). In the examples below, we will pass the URL directly to Altair:
65
+ """)
66
+ return
67
+
68
+
69
+ @app.cell
70
+ def _():
71
+ antibiotics = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/burtin.json'
72
+ return (antibiotics,)
73
+
74
+
75
+ @app.cell(hide_code=True)
76
+ def _(mo):
77
+ mo.md(r"""
78
+ We can first load the data with Pandas to view the dataset in its entirety and get acquainted with the available fields:
79
+ """)
80
+ return
81
+
82
+
83
+ @app.cell
84
+ def _(antibiotics, pd):
85
+ pd.read_json(antibiotics)
86
+ return
87
+
88
+
89
+ @app.cell(hide_code=True)
90
+ def _(mo):
91
+ mo.md(r"""
92
+ The numeric values in the table indicate the [minimum inhibitory concentration (MIC)](https://en.wikipedia.org/wiki/Minimum_inhibitory_concentration), a measure of the effectiveness of the antibiotic, which represents the concentration of antibiotic (in micrograms per milliliter) required to prevent growth in vitro. The reaction of the bacteria to a procedure called [Gram staining](https://en.wikipedia.org/wiki/Gram_stain) is described by the nominal field `Gram_Staining`. Bacteria that turn dark blue or violet are Gram-positive. Otherwise, they are Gram-negative.
93
+
94
+ As we examine different visualizations of this dataset, ask yourself: What might we learn about the relative effectiveness of the antibiotics? What might we learn about the bacterial species based on their antibiotic response?
95
+ """)
96
+ return
97
+
98
+
99
+ @app.cell(hide_code=True)
100
+ def _(mo):
101
+ mo.md(r"""
102
+ ## Configuring Scales and Axes
103
+ """)
104
+ return
105
+
106
+
107
+ @app.cell(hide_code=True)
108
+ def _(mo):
109
+ mo.md(r"""
110
+ ### Plotting Antibiotic Resistance: Adjusting the Scale Type
111
+
112
+ Let's start by looking at a simple dot plot of the MIC for Neomycin.
113
+ """)
114
+ return
115
+
116
+
117
+ @app.cell
118
+ def _(alt, antibiotics):
119
+ alt.Chart(antibiotics).mark_circle().encode(
120
+ alt.X('Neomycin:Q')
121
+ )
122
+ return
123
+
124
+
125
+ @app.cell(hide_code=True)
126
+ def _(mo):
127
+ mo.md(r"""
128
+ _We can see that the MIC values span orders of magnitude: most points to cluster on the left, with a few large outliers to the right._
129
+
130
+ By default Altair uses a `linear` mapping between the domain values (MIC) and the range values (pixels). To get a better overview of the data, we can apply a different scale transformation.
131
+ """)
132
+ return
133
+
134
+
135
+ @app.cell(hide_code=True)
136
+ def _(mo):
137
+ mo.md(r"""
138
+ To change the scale type, we'll set the `scale` attribute, using the `alt.Scale` method and `type` parameter.
139
+
140
+ Here's the result of using a square root (`sqrt`) scale type. Distances in the pixel range now correspond to the square root of distances in the data domain.
141
+ """)
142
+ return
143
+
144
+
145
+ @app.cell
146
+ def _(alt, antibiotics):
147
+ alt.Chart(antibiotics).mark_circle().encode(
148
+ alt.X('Neomycin:Q',
149
+ scale=alt.Scale(type='sqrt'))
150
+ )
151
+ return
152
+
153
+
154
+ @app.cell(hide_code=True)
155
+ def _(mo):
156
+ mo.md(r"""
157
+ _The points on the left are now better differentiated, but we still see some heavy skew._
158
+
159
+ Let's try using a [logarithmic scale](https://en.wikipedia.org/wiki/Logarithmic_scale) (`log`) instead:
160
+ """)
161
+ return
162
+
163
+
164
+ @app.cell
165
+ def _(alt, antibiotics):
166
+ alt.Chart(antibiotics).mark_circle().encode(
167
+ alt.X('Neomycin:Q',
168
+ scale=alt.Scale(type='log'))
169
+ )
170
+ return
171
+
172
+
173
+ @app.cell(hide_code=True)
174
+ def _(mo):
175
+ mo.md(r"""
176
+ _Now the data is much more evenly distributed and we can see the very large differences in concentrations required for different bacteria._
177
+
178
+ In a standard linear scale, a visual (pixel) distance of 10 units might correspond to an *addition* of 10 units in the data domain. A logarithmic transform maps between multiplication and addition, such that `log(u) + log(v) = log(u*v)`. As a result, in a logarithmic scale, a visual distance of 10 units instead corresponds to *multiplication* by 10 units in the data domain, assuming a base 10 logarithm. The `log` scale above defaults to using the logarithm base 10, but we can adjust this by providing a `base` parameter to the scale.
179
+ """)
180
+ return
181
+
182
+
183
+ @app.cell(hide_code=True)
184
+ def _(mo):
185
+ mo.md(r"""
186
+ ### Styling an Axis
187
+
188
+ Lower dosages indicate higher effectiveness. However, some people may expect values that are "better" to be "up and to the right" within a chart. If we want to cater to this convention, we can reverse the axis to encode "effectiveness" as a reversed MIC scale.
189
+
190
+ To do this, we can set the encoding `sort` property to `'descending'`:
191
+ """)
192
+ return
193
+
194
+
195
+ @app.cell
196
+ def _(alt, antibiotics):
197
+ alt.Chart(antibiotics).mark_circle().encode(
198
+ alt.X('Neomycin:Q',
199
+ sort='descending',
200
+ scale=alt.Scale(type='log'))
201
+ )
202
+ return
203
+
204
+
205
+ @app.cell(hide_code=True)
206
+ def _(mo):
207
+ mo.md(r"""
208
+ _Unfortunately the axis is starting to get a bit confusing: we're plotting data on a logarithmic scale, in the reverse direction, and without a clear indication of what our units are!_
209
+
210
+ Let's add a more informative axis title: we'll use the `title` property of the encoding to provide the desired title text:
211
+ """)
212
+ return
213
+
214
+
215
+ @app.cell
216
+ def _(alt, antibiotics):
217
+ alt.Chart(antibiotics).mark_circle().encode(
218
+ alt.X('Neomycin:Q',
219
+ sort='descending',
220
+ scale=alt.Scale(type='log'),
221
+ title='Neomycin MIC (μg/ml, reverse log scale)')
222
+ )
223
+ return
224
+
225
+
226
+ @app.cell(hide_code=True)
227
+ def _(mo):
228
+ mo.md(r"""
229
+ Much better!
230
+
231
+ By default, Altair places the x-axis along the bottom of the chart. To change these defaults, we can add an `axis` attribute with `orient='top'`:
232
+ """)
233
+ return
234
+
235
+
236
+ @app.cell
237
+ def _(alt, antibiotics):
238
+ alt.Chart(antibiotics).mark_circle().encode(
239
+ alt.X('Neomycin:Q',
240
+ sort='descending',
241
+ scale=alt.Scale(type='log'),
242
+ axis=alt.Axis(orient='top'),
243
+ title='Neomycin MIC (μg/ml, reverse log scale)')
244
+ )
245
+ return
246
+
247
+
248
+ @app.cell(hide_code=True)
249
+ def _(mo):
250
+ mo.md(r"""
251
+ Similarly, the y-axis defaults to a `'left'` orientation, but can be set to `'right'`.
252
+ """)
253
+ return
254
+
255
+
256
+ @app.cell(hide_code=True)
257
+ def _(mo):
258
+ mo.md(r"""
259
+ ### Comparing Antibiotics: Adjusting Grid Lines, Tick Counts, and Sizing
260
+
261
+ _How does neomycin compare to other antibiotics, such as streptomycin and penicillin?_
262
+
263
+ To start answering this question, we can create scatter plots, adding a y-axis encoding for another antibiotic that mirrors the design of our x-axis for neomycin.
264
+ """)
265
+ return
266
+
267
+
268
+ @app.cell
269
+ def _(alt, antibiotics):
270
+ alt.Chart(antibiotics).mark_circle().encode(
271
+ alt.X('Neomycin:Q',
272
+ sort='descending',
273
+ scale=alt.Scale(type='log'),
274
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
275
+ alt.Y('Streptomycin:Q',
276
+ sort='descending',
277
+ scale=alt.Scale(type='log'),
278
+ title='Streptomycin MIC (μg/ml, reverse log scale)')
279
+ )
280
+ return
281
+
282
+
283
+ @app.cell(hide_code=True)
284
+ def _(mo):
285
+ mo.md(r"""
286
+ _We can see that neomycin and streptomycin appear highly correlated, as the bacterial strains respond similarly to both antibiotics._
287
+
288
+ Let's move on and compare neomycin with penicillin:
289
+ """)
290
+ return
291
+
292
+
293
+ @app.cell
294
+ def _(alt, antibiotics):
295
+ alt.Chart(antibiotics).mark_circle().encode(
296
+ alt.X('Neomycin:Q',
297
+ sort='descending',
298
+ scale=alt.Scale(type='log'),
299
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
300
+ alt.Y('Penicillin:Q',
301
+ sort='descending',
302
+ scale=alt.Scale(type='log'),
303
+ title='Penicillin MIC (μg/ml, reverse log scale)')
304
+ )
305
+ return
306
+
307
+
308
+ @app.cell(hide_code=True)
309
+ def _(mo):
310
+ mo.md(r"""
311
+ _Now we see a more differentiated response: some bacteria respond well to neomycin but not penicillin, and vice versa!_
312
+
313
+ While this plot is useful, we can make it better. The x and y axes use the same units, but have different extents (the chart width is larger than the height) and different domains (0.001 to 100 for the x-axis, and 0.001 to 1,000 for the y-axis).
314
+
315
+ Let's equalize the axes: we can add explicit `width` and `height` settings for the chart, and specify matching domains using the scale `domain` property.
316
+ """)
317
+ return
318
+
319
+
320
+ @app.cell
321
+ def _(alt, antibiotics):
322
+ alt.Chart(antibiotics).mark_circle().encode(
323
+ alt.X('Neomycin:Q',
324
+ sort='descending',
325
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
326
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
327
+ alt.Y('Penicillin:Q',
328
+ sort='descending',
329
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
330
+ title='Penicillin MIC (μg/ml, reverse log scale)')
331
+ ).properties(width=250, height=250)
332
+ return
333
+
334
+
335
+ @app.cell(hide_code=True)
336
+ def _(mo):
337
+ mo.md(r"""
338
+ _The resulting plot is more balanced, and less prone to subtle misinterpretations!_
339
+
340
+ However, the grid lines are now rather dense. If we want to remove grid lines altogether, we can add `grid=False` to the `axis` attribute. But what if we instead want to reduce the number of tick marks, for example only including grid lines for each order of magnitude?
341
+
342
+ To change the number of ticks, we can specify a target `tickCount` property for an `Axis` object. The `tickCount` is treated as a *suggestion* to Altair, to be considered alongside other aspects such as using nice, human-friendly intervals. We may not get *exactly* the number of tick marks we request, but we should get something close.
343
+ """)
344
+ return
345
+
346
+
347
+ @app.cell
348
+ def _(alt, antibiotics):
349
+ alt.Chart(antibiotics).mark_circle().encode(
350
+ alt.X('Neomycin:Q',
351
+ sort='descending',
352
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
353
+ axis=alt.Axis(tickCount=5),
354
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
355
+ alt.Y('Penicillin:Q',
356
+ sort='descending',
357
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
358
+ axis=alt.Axis(tickCount=5),
359
+ title='Penicillin MIC (μg/ml, reverse log scale)')
360
+ ).properties(width=250, height=250)
361
+ return
362
+
363
+
364
+ @app.cell(hide_code=True)
365
+ def _(mo):
366
+ mo.md(r"""
367
+ By setting the `tickCount` to 5, we have the desired effect.
368
+
369
+ Our scatter plot points feel a bit small. Let's change the default size by setting the `size` property of the circle mark. This size value is the *area* of the mark in pixels.
370
+ """)
371
+ return
372
+
373
+
374
+ @app.cell
375
+ def _(alt, antibiotics):
376
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
377
+ alt.X('Neomycin:Q',
378
+ sort='descending',
379
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
380
+ axis=alt.Axis(tickCount=5),
381
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
382
+ alt.Y('Penicillin:Q',
383
+ sort='descending',
384
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
385
+ axis=alt.Axis(tickCount=5),
386
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
387
+ ).properties(width=250, height=250)
388
+ return
389
+
390
+
391
+ @app.cell(hide_code=True)
392
+ def _(mo):
393
+ mo.md(r"""
394
+ Here we've set the circle mark area to 80 pixels. _Further adjust the value as you see fit!_
395
+ """)
396
+ return
397
+
398
+
399
+ @app.cell(hide_code=True)
400
+ def _(mo):
401
+ mo.md(r"""
402
+ ## Configuring Color Legends
403
+ """)
404
+ return
405
+
406
+
407
+ @app.cell(hide_code=True)
408
+ def _(mo):
409
+ mo.md(r"""
410
+ ### Color by Gram Staining
411
+
412
+ _Above we saw that neomycin is more effective for some bacteria, while penicillin is more effective for others. But how can we tell which antibiotic to use if we don't know the specific species of bacteria? Gram staining serves as a diagnostic for discriminating classes of bacteria!_
413
+
414
+ Let's encode `Gram_Staining` on the `color` channel as a nominal data type:
415
+ """)
416
+ return
417
+
418
+
419
+ @app.cell
420
+ def _(alt, antibiotics):
421
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
422
+ alt.X('Neomycin:Q',
423
+ sort='descending',
424
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
425
+ axis=alt.Axis(tickCount=5),
426
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
427
+ alt.Y('Penicillin:Q',
428
+ sort='descending',
429
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
430
+ axis=alt.Axis(tickCount=5),
431
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
432
+ alt.Color('Gram_Staining:N')
433
+ ).properties(width=250, height=250)
434
+ return
435
+
436
+
437
+ @app.cell(hide_code=True)
438
+ def _(mo):
439
+ mo.md(r"""
440
+ _We can see that Gram-positive bacteria seem most susceptible to penicillin, whereas neomycin is more effective for Gram-negative bacteria!_
441
+
442
+ The color scheme above was automatically chosen to provide perceptually-distinguishable colors for nominal (equal or not equal) comparisons. However, we might wish to customize the colors used. In this case, Gram staining results in [distinctive physical colorings: pink for Gram-negative, purple for Gram-positive](https://en.wikipedia.org/wiki/Gram_stain#/media/File:Gram_stain_01.jpg).
443
+
444
+ Let's use those colors by specifying an explicit scale mapping from the data `domain` to the color `range`:
445
+ """)
446
+ return
447
+
448
+
449
+ @app.cell
450
+ def _(alt, antibiotics):
451
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
452
+ alt.X('Neomycin:Q',
453
+ sort='descending',
454
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
455
+ axis=alt.Axis(tickCount=5),
456
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
457
+ alt.Y('Penicillin:Q',
458
+ sort='descending',
459
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
460
+ axis=alt.Axis(tickCount=5),
461
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
462
+ alt.Color('Gram_Staining:N',
463
+ scale=alt.Scale(domain=['negative', 'positive'], range=['hotpink', 'purple'])
464
+ )
465
+ ).properties(width=250, height=250)
466
+ return
467
+
468
+
469
+ @app.cell(hide_code=True)
470
+ def _(mo):
471
+ mo.md(r"""
472
+ By default legends are placed on the right side of the chart. Similar to axes, we can change the legend orientation using the `orient` parameter:
473
+ """)
474
+ return
475
+
476
+
477
+ @app.cell
478
+ def _(alt, antibiotics):
479
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
480
+ alt.X('Neomycin:Q',
481
+ sort='descending',
482
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
483
+ axis=alt.Axis(tickCount=5),
484
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
485
+ alt.Y('Penicillin:Q',
486
+ sort='descending',
487
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
488
+ axis=alt.Axis(tickCount=5),
489
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
490
+ alt.Color('Gram_Staining:N',
491
+ scale=alt.Scale(domain=['negative', 'positive'], range=['hotpink', 'purple']),
492
+ legend=alt.Legend(orient='left')
493
+ )
494
+ ).properties(width=250, height=250)
495
+ return
496
+
497
+
498
+ @app.cell(hide_code=True)
499
+ def _(mo):
500
+ mo.md(r"""
501
+ We can also remove a legend entirely by specifying `legend=None`:
502
+ """)
503
+ return
504
+
505
+
506
+ @app.cell
507
+ def _(alt, antibiotics):
508
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
509
+ alt.X('Neomycin:Q',
510
+ sort='descending',
511
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
512
+ axis=alt.Axis(tickCount=5),
513
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
514
+ alt.Y('Penicillin:Q',
515
+ sort='descending',
516
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
517
+ axis=alt.Axis(tickCount=5),
518
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
519
+ alt.Color('Gram_Staining:N',
520
+ scale=alt.Scale(domain=['negative', 'positive'], range=['hotpink', 'purple']),
521
+ legend=None
522
+ )
523
+ ).properties(width=250, height=250)
524
+ return
525
+
526
+
527
+ @app.cell(hide_code=True)
528
+ def _(mo):
529
+ mo.md(r"""
530
+ ### Color by Species
531
+
532
+ _So far we've considered the effectiveness of antibiotics. Let's turn around and ask a different question: what might antibiotic response teach us about the different species of bacteria?_
533
+
534
+ To start, let's encode `Bacteria` (a nominal data field) using the `color` channel:
535
+ """)
536
+ return
537
+
538
+
539
+ @app.cell
540
+ def _(alt, antibiotics):
541
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
542
+ alt.X('Neomycin:Q',
543
+ sort='descending',
544
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
545
+ axis=alt.Axis(tickCount=5),
546
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
547
+ alt.Y('Penicillin:Q',
548
+ sort='descending',
549
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
550
+ axis=alt.Axis(tickCount=5),
551
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
552
+ alt.Color('Bacteria:N')
553
+ ).properties(width=250, height=250)
554
+ return
555
+
556
+
557
+ @app.cell(hide_code=True)
558
+ def _(mo):
559
+ mo.md(r"""
560
+ _The result is a bit of a mess!_ There are enough unique bacteria that Altair starts repeating colors from its default 10-color palette for nominal values.
561
+
562
+ To use custom colors, we can update the color encoding `scale` property. One option is to provide explicit scale `domain` and `range` values to indicate the precise color mappings per value, as we did above for Gram staining. Another option is to use an alternative color scheme. Altair includes a variety of built-in color schemes. For a complete list, see the [Vega color scheme documentation](https://vega.github.io/vega/docs/schemes/#reference).
563
+
564
+ Let's try switching to a built-in 20-color scheme, `tableau20`, and set that using the scale `scheme` property.
565
+ """)
566
+ return
567
+
568
+
569
+ @app.cell
570
+ def _(alt, antibiotics):
571
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
572
+ alt.X('Neomycin:Q',
573
+ sort='descending',
574
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
575
+ axis=alt.Axis(tickCount=5),
576
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
577
+ alt.Y('Penicillin:Q',
578
+ sort='descending',
579
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
580
+ axis=alt.Axis(tickCount=5),
581
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
582
+ alt.Color('Bacteria:N',
583
+ scale=alt.Scale(scheme='tableau20'))
584
+ ).properties(width=250, height=250)
585
+ return
586
+
587
+
588
+ @app.cell(hide_code=True)
589
+ def _(mo):
590
+ mo.md(r"""
591
+ _We now have a unique color for each bacteria, but the chart is still a mess. Among other issues, the encoding takes no account of bacteria that belong to the same genus. In the chart above, the two different Salmonella strains have very different hues (teal and pink), despite being biological cousins._
592
+
593
+ To try a different scheme, we can also change the data type from nominal to ordinal. The default ordinal scheme uses blue shades, ramping from light to dark:
594
+ """)
595
+ return
596
+
597
+
598
+ @app.cell
599
+ def _(alt, antibiotics):
600
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
601
+ alt.X('Neomycin:Q',
602
+ sort='descending',
603
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
604
+ axis=alt.Axis(tickCount=5),
605
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
606
+ alt.Y('Penicillin:Q',
607
+ sort='descending',
608
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
609
+ axis=alt.Axis(tickCount=5),
610
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
611
+ alt.Color('Bacteria:O')
612
+ ).properties(width=250, height=250)
613
+ return
614
+
615
+
616
+ @app.cell(hide_code=True)
617
+ def _(mo):
618
+ mo.md(r"""
619
+ _Some of those blue shades may be hard to distinguish._
620
+
621
+ For more differentiated colors, we can experiment with alternatives to the default `blues` color scheme. The `viridis` scheme ramps through both hue and luminance:
622
+ """)
623
+ return
624
+
625
+
626
+ @app.cell
627
+ def _(alt, antibiotics):
628
+ alt.Chart(antibiotics).mark_circle(size=80).encode(
629
+ alt.X('Neomycin:Q',
630
+ sort='descending',
631
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
632
+ axis=alt.Axis(tickCount=5),
633
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
634
+ alt.Y('Penicillin:Q',
635
+ sort='descending',
636
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
637
+ axis=alt.Axis(tickCount=5),
638
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
639
+ alt.Color('Bacteria:O',
640
+ scale=alt.Scale(scheme='viridis'))
641
+ ).properties(width=250, height=250)
642
+ return
643
+
644
+
645
+ @app.cell(hide_code=True)
646
+ def _(mo):
647
+ mo.md(r"""
648
+ _Bacteria from the same genus now have more similar colors than before, but the chart still remains confusing. There are many colors, they are hard to look up in the legend accurately, and two bacteria may have similar colors but different genus._
649
+ """)
650
+ return
651
+
652
+
653
+ @app.cell(hide_code=True)
654
+ def _(mo):
655
+ mo.md(r"""
656
+ ### Color by Genus
657
+
658
+ Let's try to color by genus instead of bacteria. To do so, we will add a `calculate` transform that splits up the bacteria name on space characters and takes the first word in the resulting array. We can then encode the resulting `Genus` field using the `tableau20` color scheme.
659
+
660
+ (Note that the antibiotics dataset includes a pre-calculated `Genus` field, but we will ignore it here in order to further explore Altair's data transformations.)
661
+ """)
662
+ return
663
+
664
+
665
+ @app.cell
666
+ def _(alt, antibiotics):
667
+ alt.Chart(antibiotics).mark_circle(size=80).transform_calculate(
668
+ Genus='split(datum.Bacteria, " ")[0]'
669
+ ).encode(
670
+ alt.X('Neomycin:Q',
671
+ sort='descending',
672
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
673
+ axis=alt.Axis(tickCount=5),
674
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
675
+ alt.Y('Penicillin:Q',
676
+ sort='descending',
677
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
678
+ axis=alt.Axis(tickCount=5),
679
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
680
+ alt.Color('Genus:N',
681
+ scale=alt.Scale(scheme='tableau20'))
682
+ ).properties(width=250, height=250)
683
+ return
684
+
685
+
686
+ @app.cell(hide_code=True)
687
+ def _(mo):
688
+ mo.md(r"""
689
+ _Hmm... While the data are better segregated by genus, this cacapohony of colors doesn't seem particularly useful._
690
+
691
+ _If we look at some of the previous charts carefully, we can see that only a handful of bacteria have a genus shared with another bacteria: Salmonella, Staphylococcus, and Streptococcus. To focus our comparison, we might add colors only for these repeated genus values._
692
+
693
+ Let's add another `calculate` transform that takes a genus name, keeps it if it is one of the repeated values, and otherwise uses the string `"Other"`.
694
+
695
+ In addition, we can add custom color encodings using explicit `domain` and `range` arrays for the color encoding `scale`.
696
+ """)
697
+ return
698
+
699
+
700
+ @app.cell
701
+ def _(alt, antibiotics):
702
+ alt.Chart(antibiotics).mark_circle(size=80).transform_calculate(
703
+ Split='split(datum.Bacteria, " ")[0]'
704
+ ).transform_calculate(
705
+ Genus='indexof(["Salmonella", "Staphylococcus", "Streptococcus"], datum.Split) >= 0 ? datum.Split : "Other"'
706
+ ).encode(
707
+ alt.X('Neomycin:Q',
708
+ sort='descending',
709
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
710
+ axis=alt.Axis(tickCount=5),
711
+ title='Neomycin MIC (μg/ml, reverse log scale)'),
712
+ alt.Y('Penicillin:Q',
713
+ sort='descending',
714
+ scale=alt.Scale(type='log', domain=[0.001, 1000]),
715
+ axis=alt.Axis(tickCount=5),
716
+ title='Penicillin MIC (μg/ml, reverse log scale)'),
717
+ alt.Color('Genus:N',
718
+ scale=alt.Scale(
719
+ domain=['Salmonella', 'Staphylococcus', 'Streptococcus', 'Other'],
720
+ range=['rgb(76,120,168)', 'rgb(84,162,75)', 'rgb(228,87,86)', 'rgb(121,112,110)']
721
+ ))
722
+ ).properties(width=250, height=250)
723
+ return
724
+
725
+
726
+ @app.cell(hide_code=True)
727
+ def _(mo):
728
+ mo.md(r"""
729
+ _We now have a much more revealing plot, made possible by customizations to the axes and legend. Take a moment to examine the plot above. Notice any surprising groupings?_
730
+
731
+ _The upper-left region has a cluster of red Streptococcus bacteria, but with a grey Other bacteria alongside them. Meanwhile, towards the middle-right we see another red Streptococcus placed far away from its "cousins". Might we expect bacteria from the same genus (and thus presumably more genetically similar) to be grouped closer together?_
732
+
733
+ As it so happens, the underlying dataset actually contains errors. The dataset reflects the species designations used in the early 1950s. However, the scientific consensus has since been overturned. That gray point in the upper-left? It's now considered a Streptococcus! That red point towards the middle-right? It's no longer considered a Streptococcus!
734
+
735
+ Of course, on its own, this dataset doesn't fully justify these reclassifications. Nevertheless, the data contain valuable biological clues that went overlooked for decades! Visualization, when used by an appropriately skilled and inquisitive viewer, can be a powerful tool for discovery.
736
+
737
+ This example also reinforces an important lesson: **_always be skeptical of your data!_**
738
+ """)
739
+ return
740
+
741
+
742
+ @app.cell(hide_code=True)
743
+ def _(mo):
744
+ mo.md(r"""
745
+ ### Color by Antibiotic Response
746
+
747
+ We might also use the `color` channel to encode quantitative values. Though keep in mind that typically color is not as effective for conveying quantities as position or size encodings!
748
+
749
+ Here is a basic heatmap of penicillin MIC values for each bacteria. We'll use a `rect` mark and sort the bacteria by descending MIC values (from most to least resistant):
750
+ """)
751
+ return
752
+
753
+
754
+ @app.cell
755
+ def _(alt, antibiotics):
756
+ alt.Chart(antibiotics).mark_rect().encode(
757
+ alt.Y('Bacteria:N',
758
+ sort=alt.EncodingSortField(field='Penicillin', op='max', order='descending')
759
+ ),
760
+ alt.Color('Penicillin:Q')
761
+ )
762
+ return
763
+
764
+
765
+ @app.cell(hide_code=True)
766
+ def _(mo):
767
+ mo.md(r"""
768
+ We can further improve this chart by combining features we've seen thus far: a log-transformed scale, a change of axis orientation, a custom color scheme (`plasma`), tick count adjustment, and custom title text. We'll also exercise configuration options to adjust the axis title placement and legend title alignment.
769
+ """)
770
+ return
771
+
772
+
773
+ @app.cell
774
+ def _(alt, antibiotics):
775
+ alt.Chart(antibiotics).mark_rect().encode(
776
+ alt.Y('Bacteria:N',
777
+ sort=alt.EncodingSortField(field='Penicillin', op='max', order='descending'),
778
+ axis=alt.Axis(
779
+ orient='right', # orient axis on right side of chart
780
+ titleX=7, # set x-position to 7 pixels right of chart
781
+ titleY=-2, # set y-position to 2 pixels above chart
782
+ titleAlign='left', # use left-aligned text
783
+ titleAngle=0 # undo default title rotation
784
+ )
785
+ ),
786
+ alt.Color('Penicillin:Q',
787
+ scale=alt.Scale(type='log', scheme='plasma', nice=True),
788
+ legend=alt.Legend(titleOrient='right', tickCount=5),
789
+ title='Penicillin MIC (μg/ml)'
790
+ )
791
+ )
792
+ return
793
+
794
+
795
+ @app.cell(hide_code=True)
796
+ def _(mo):
797
+ mo.md(r"""
798
+ Alternatively, we can remove the axis title altogether, and use the top-level `title` property to add a title for the entire chart:
799
+ """)
800
+ return
801
+
802
+
803
+ @app.cell
804
+ def _(alt, antibiotics):
805
+ alt.Chart(antibiotics, title='Penicillin Resistance of Bacterial Strains').mark_rect().encode(
806
+ alt.Y('Bacteria:N',
807
+ sort=alt.EncodingSortField(field='Penicillin', op='max', order='descending'),
808
+ axis=alt.Axis(orient='right', title=None)
809
+ ),
810
+ alt.Color('Penicillin:Q',
811
+ scale=alt.Scale(type='log', scheme='plasma', nice=True),
812
+ legend=alt.Legend(titleOrient='right', tickCount=5),
813
+ title='Penicillin MIC (μg/ml)'
814
+ )
815
+ ).configure_title(
816
+ anchor='start', # anchor and left-align title
817
+ offset=5 # set title offset from chart
818
+ )
819
+ return
820
+
821
+
822
+ @app.cell(hide_code=True)
823
+ def _(mo):
824
+ mo.md(r"""
825
+ ## Summary
826
+
827
+ Integrating what we've learned across the notebooks so far about encodings, data transforms, and customization, you should now be prepared to make a wide variety of statistical graphics. Now you can put Altair into everyday use for exploring and communicating data!
828
+
829
+ Interested in learning more about this topic?
830
+
831
+ - Start with the [Altair Customizing Visualizations documentation](https://altair-viz.github.io/user_guide/customization.html).
832
+ - For a complementary discussion of scale mappings, see ["Introducing d3-scale"](https://medium.com/@mbostock/introducing-d3-scale-61980c51545f).
833
+ - For a more in-depth exploration of all the ways axes and legends can be styled by the underlying Vega library (which powers Altair and Vega-Lite), see ["A Guide to Guides: Axes & Legends in Vega"](https://beta.observablehq.com/@jheer/a-guide-to-guides-axes-legends-in-vega).
834
+ - For a fascinating history of the antibiotics dataset, see [Wainer &amp; Lysen's "That's Funny..."](https://www.americanscientist.org/article/thats-funny) in the _American Scientist_.
835
+ """)
836
+ return
837
+
838
+
839
+ if __name__ == "__main__":
840
+ app.run()
altair/05_view_composition.py ADDED
@@ -0,0 +1,818 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # ]
8
+ # ///
9
+
10
+ import marimo
11
+
12
+ __generated_with = "0.20.4"
13
+ app = marimo.App()
14
+
15
+
16
+ @app.cell
17
+ def _():
18
+ import marimo as mo
19
+
20
+ return (mo,)
21
+
22
+
23
+ @app.cell(hide_code=True)
24
+ def _(mo):
25
+ mo.md(r"""
26
+ # Multi-View Composition
27
+
28
+ When visualizing a number of different data fields, we might be tempted to use as many visual encoding channels as we can: `x`, `y`, `color`, `size`, `shape`, and so on. However, as the number of encoding channels increases, a chart can rapidly become cluttered and difficult to read. An alternative to "over-loading" a single chart is to instead _compose multiple charts_ in a way that facilitates rapid comparisons.
29
+
30
+ In this notebook, we will examine a variety of operations for _multi-view composition_:
31
+
32
+ - _layer_: place compatible charts directly on top of each other,
33
+ - _facet_: partition data into multiple charts, organized in rows or columns,
34
+ - _concatenate_: position arbitrary charts within a shared layout, and
35
+ - _repeat_: take a base chart specification and apply it to multiple data fields.
36
+
37
+ We'll then look at how these operations form a _view composition algebra_, in which the operations can be combined to build a variety of complex multi-view displays.
38
+
39
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
40
+ """)
41
+ return
42
+
43
+
44
+ @app.cell
45
+ def _():
46
+ import pandas as pd
47
+ import altair as alt
48
+
49
+ return alt, pd
50
+
51
+
52
+ @app.cell(hide_code=True)
53
+ def _(mo):
54
+ mo.md(r"""
55
+ ## Weather Data
56
+
57
+ We will be visualizing weather statistics for the U.S. cities of Seattle and New York. Let's load the dataset and peek at the first and last 10 rows:
58
+ """)
59
+ return
60
+
61
+
62
+ @app.cell
63
+ def _():
64
+ weather = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/weather.csv'
65
+ return (weather,)
66
+
67
+
68
+ @app.cell
69
+ def _(pd, weather):
70
+ df = pd.read_csv(weather)
71
+ df.head(10)
72
+ return (df,)
73
+
74
+
75
+ @app.cell
76
+ def _(df):
77
+ df.tail(10)
78
+ return
79
+
80
+
81
+ @app.cell(hide_code=True)
82
+ def _(mo):
83
+ mo.md(r"""
84
+ We will create multi-view displays to examine weather within and across the cities.
85
+ """)
86
+ return
87
+
88
+
89
+ @app.cell(hide_code=True)
90
+ def _(mo):
91
+ mo.md(r"""
92
+ ## Layer
93
+ """)
94
+ return
95
+
96
+
97
+ @app.cell(hide_code=True)
98
+ def _(mo):
99
+ mo.md(r"""
100
+ One of the most common ways of combining multiple charts is to *layer* marks on top of each other. If the underlying scale domains are compatible, we can merge them to form _shared axes_. If either of the `x` or `y` encodings is not compatible, we might instead create a _dual-axis chart_, which overlays marks using separate scales and axes.
101
+ """)
102
+ return
103
+
104
+
105
+ @app.cell(hide_code=True)
106
+ def _(mo):
107
+ mo.md(r"""
108
+ ### Shared Axes
109
+ """)
110
+ return
111
+
112
+
113
+ @app.cell(hide_code=True)
114
+ def _(mo):
115
+ mo.md(r"""
116
+ Let's start by plotting the minimum and maximum average temperatures per month:
117
+ """)
118
+ return
119
+
120
+
121
+ @app.cell
122
+ def _(alt, weather):
123
+ alt.Chart(weather).mark_area().encode(
124
+ alt.X('month(date):T'),
125
+ alt.Y('average(temp_max):Q'),
126
+ alt.Y2('average(temp_min):Q')
127
+ )
128
+ return
129
+
130
+
131
+ @app.cell(hide_code=True)
132
+ def _(mo):
133
+ mo.md(r"""
134
+ _The plot shows us temperature ranges for each month over the entirety of our data. However, this is pretty misleading as it aggregates the measurements for both Seattle and New York!_
135
+
136
+ Let's subdivide the data by location using a color encoding, while also adjusting the mark opacity to accommodate overlapping areas:
137
+ """)
138
+ return
139
+
140
+
141
+ @app.cell
142
+ def _(alt, weather):
143
+ alt.Chart(weather).mark_area(opacity=0.3).encode(
144
+ alt.X('month(date):T'),
145
+ alt.Y('average(temp_max):Q'),
146
+ alt.Y2('average(temp_min):Q'),
147
+ alt.Color('location:N')
148
+ )
149
+ return
150
+
151
+
152
+ @app.cell(hide_code=True)
153
+ def _(mo):
154
+ mo.md(r"""
155
+ _We can see that Seattle is more temperate: warmer in the winter, and cooler in the summer._
156
+
157
+ In this case we've created a layered chart without any special features by simply subdividing the area marks by color. While the chart above shows us the temperature ranges, we might also want to emphasize the middle of the range.
158
+
159
+ Let's create a line chart showing the average temperature midpoint. We'll use a `calculate` transform to compute the midpoints between the minimum and maximum daily temperatures:
160
+ """)
161
+ return
162
+
163
+
164
+ @app.cell
165
+ def _(alt, weather):
166
+ alt.Chart(weather).mark_line().transform_calculate(
167
+ temp_mid='(+datum.temp_min + +datum.temp_max) / 2'
168
+ ).encode(
169
+ alt.X('month(date):T'),
170
+ alt.Y('average(temp_mid):Q'),
171
+ alt.Color('location:N')
172
+ )
173
+ return
174
+
175
+
176
+ @app.cell(hide_code=True)
177
+ def _(mo):
178
+ mo.md(r"""
179
+ _Aside_: note the use of `+datum.temp_min` within the calculate transform. As we are loading the data directly from a CSV file without any special parsing instructions, the temperature values may be internally represented as string values. Adding the `+` in front of the value forces it to be treated as a number.
180
+
181
+ We'd now like to combine these charts by layering the midpoint lines over the range areas. Using the syntax `chart1 + chart2`, we can specify that we want a new layered chart in which `chart1` is the first layer and `chart2` is a second layer drawn on top:
182
+ """)
183
+ return
184
+
185
+
186
+ @app.cell
187
+ def _(alt, weather):
188
+ tempMinMax = alt.Chart(weather).mark_area(opacity=0.3).encode(
189
+ alt.X('month(date):T'),
190
+ alt.Y('average(temp_max):Q'),
191
+ alt.Y2('average(temp_min):Q'),
192
+ alt.Color('location:N')
193
+ )
194
+
195
+ tempMid = alt.Chart(weather).mark_line().transform_calculate(
196
+ temp_mid='(+datum.temp_min + +datum.temp_max) / 2'
197
+ ).encode(
198
+ alt.X('month(date):T'),
199
+ alt.Y('average(temp_mid):Q'),
200
+ alt.Color('location:N')
201
+ )
202
+
203
+ tempMinMax + tempMid
204
+ return
205
+
206
+
207
+ @app.cell(hide_code=True)
208
+ def _(mo):
209
+ mo.md(r"""
210
+ _Now we have a multi-layer plot! However, the y-axis title (though informative) has become a bit long and unruly..._
211
+
212
+ Let's customize our axes to clean up the plot. If we set a custom axis title within one of the layers, it will automatically be used as a shared axis title for all the layers:
213
+ """)
214
+ return
215
+
216
+
217
+ @app.cell
218
+ def _(alt, weather):
219
+ tempMinMax_1 = alt.Chart(weather).mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature °C'), alt.Y2('average(temp_min):Q'), alt.Color('location:N'))
220
+ tempMid_1 = alt.Chart(weather).mark_line().transform_calculate(temp_mid='(+datum.temp_min + +datum.temp_max) / 2').encode(alt.X('month(date):T'), alt.Y('average(temp_mid):Q'), alt.Color('location:N'))
221
+ tempMinMax_1 + tempMid_1
222
+ return tempMid_1, tempMinMax_1
223
+
224
+
225
+ @app.cell(hide_code=True)
226
+ def _(mo):
227
+ mo.md(r"""
228
+ _What happens if both layers have custom axis titles? Modify the code above to find out..._
229
+
230
+ Above used the `+` operator, a convenient shorthand for Altair's `layer` method. We can generate an identical layered chart using the `layer` method directly:
231
+ """)
232
+ return
233
+
234
+
235
+ @app.cell
236
+ def _(alt, tempMid_1, tempMinMax_1):
237
+ alt.layer(tempMinMax_1, tempMid_1)
238
+ return
239
+
240
+
241
+ @app.cell(hide_code=True)
242
+ def _(mo):
243
+ mo.md(r"""
244
+ Note that the order of inputs to a layer matters, as subsequent layers will be drawn on top of earlier layers. _Try swapping the order of the charts in the cells above. What happens? (Hint: look closely at the color of the `line` marks.)_
245
+ """)
246
+ return
247
+
248
+
249
+ @app.cell(hide_code=True)
250
+ def _(mo):
251
+ mo.md(r"""
252
+ ### Dual-Axis Charts
253
+ """)
254
+ return
255
+
256
+
257
+ @app.cell(hide_code=True)
258
+ def _(mo):
259
+ mo.md(r"""
260
+ _Seattle has a reputation as a rainy city. Is that deserved?_
261
+
262
+ Let's look at precipitation alongside temperature to learn more. First let's create a base plot the shows average monthly precipitation in Seattle:
263
+ """)
264
+ return
265
+
266
+
267
+ @app.cell
268
+ def _(alt, weather):
269
+ alt.Chart(weather).transform_filter(
270
+ 'datum.location == "Seattle"'
271
+ ).mark_line(
272
+ interpolate='monotone',
273
+ stroke='grey'
274
+ ).encode(
275
+ alt.X('month(date):T', title=None),
276
+ alt.Y('average(precipitation):Q', title='Precipitation')
277
+ )
278
+ return
279
+
280
+
281
+ @app.cell(hide_code=True)
282
+ def _(mo):
283
+ mo.md(r"""
284
+ To facilitate comparison with the temperature data, let's create a new layered chart. Here's what happens if we try to layer the charts as we did earlier:
285
+ """)
286
+ return
287
+
288
+
289
+ @app.cell
290
+ def _(alt, weather):
291
+ tempMinMax_2 = alt.Chart(weather).transform_filter('datum.location == "Seattle"').mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature °C'), alt.Y2('average(temp_min):Q'))
292
+ _precip = alt.Chart(weather).transform_filter('datum.location == "Seattle"').mark_line(interpolate='monotone', stroke='grey').encode(alt.X('month(date):T'), alt.Y('average(precipitation):Q', title='Precipitation'))
293
+ alt.layer(tempMinMax_2, _precip)
294
+ return
295
+
296
+
297
+ @app.cell(hide_code=True)
298
+ def _(mo):
299
+ mo.md(r"""
300
+ _The precipitation values use a much smaller range of the y-axis then the temperatures!_
301
+
302
+ By default, layered charts use a *shared domain*: the values for the x-axis or y-axis are combined across all the layers to determine a shared extent. This default behavior assumes that the layered values have the same units. However, this doesn't hold up for this example, as we are combining temperature values (degrees Celsius) with precipitation values (inches)!
303
+
304
+ If we want to use different y-axis scales, we need to specify how we want Altair to *resolve* the data across layers. In this case, we want to resolve the y-axis `scale` domains to be `independent` rather than use a `shared` domain. The `Chart` object produced by a layer operator includes a `resolve_scale` method with which we can specify the desired resolution:
305
+ """)
306
+ return
307
+
308
+
309
+ @app.cell
310
+ def _(alt, weather):
311
+ tempMinMax_3 = alt.Chart(weather).transform_filter('datum.location == "Seattle"').mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature °C'), alt.Y2('average(temp_min):Q'))
312
+ _precip = alt.Chart(weather).transform_filter('datum.location == "Seattle"').mark_line(interpolate='monotone', stroke='grey').encode(alt.X('month(date):T'), alt.Y('average(precipitation):Q', title='Precipitation'))
313
+ alt.layer(tempMinMax_3, _precip).resolve_scale(y='independent')
314
+ return
315
+
316
+
317
+ @app.cell(hide_code=True)
318
+ def _(mo):
319
+ mo.md(r"""
320
+ _We can now see that autumn is the rainiest season in Seattle (peaking in November), complemented by dry summers._
321
+
322
+ You may have noticed some redundancy in our plot specifications above: both use the same dataset and the same filter to look at Seattle only. If you want, you can streamline the code a bit by providing the data and filter transform to the top-level layered chart. The individual layers will then inherit the data if they don't have their own data definitions:
323
+ """)
324
+ return
325
+
326
+
327
+ @app.cell
328
+ def _(alt, weather):
329
+ tempMinMax_4 = alt.Chart().mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature °C'), alt.Y2('average(temp_min):Q'))
330
+ _precip = alt.Chart().mark_line(interpolate='monotone', stroke='grey').encode(alt.X('month(date):T'), alt.Y('average(precipitation):Q', title='Precipitation'))
331
+ alt.layer(tempMinMax_4, _precip, data=weather).transform_filter('datum.location == "Seattle"').resolve_scale(y='independent')
332
+ return
333
+
334
+
335
+ @app.cell(hide_code=True)
336
+ def _(mo):
337
+ mo.md(r"""
338
+ While dual-axis charts can be useful, _they are often prone to misinterpretation_, as the different units and axis scales may be incommensurate. As is feasible, you might consider transformations that map different data fields to shared units, for example showing [quantiles](https://en.wikipedia.org/wiki/Quantile) or relative percentage change.
339
+ """)
340
+ return
341
+
342
+
343
+ @app.cell(hide_code=True)
344
+ def _(mo):
345
+ mo.md(r"""
346
+ ## Facet
347
+ """)
348
+ return
349
+
350
+
351
+ @app.cell(hide_code=True)
352
+ def _(mo):
353
+ mo.md(r"""
354
+ *Faceting* involves subdividing a dataset into groups and creating a separate plot for each group. In earlier notebooks, we learned how to create faceted charts using the `row` and `column` encoding channels. We'll first review those channels and then show how they are instances of the more general `facet` operator.
355
+
356
+ Let's start with a basic histogram of maximum temperature values in Seattle:
357
+ """)
358
+ return
359
+
360
+
361
+ @app.cell
362
+ def _(alt, weather):
363
+ alt.Chart(weather).mark_bar().transform_filter(
364
+ 'datum.location == "Seattle"'
365
+ ).encode(
366
+ alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
367
+ alt.Y('count():Q')
368
+ )
369
+ return
370
+
371
+
372
+ @app.cell(hide_code=True)
373
+ def _(mo):
374
+ mo.md(r"""
375
+ _How does this temperature profile change based on the weather of a given day – that is, whether there was drizzle, fog, rain, snow, or sun?_
376
+
377
+ Let's use the `column` encoding channel to facet the data by weather type. We can also use `color` as a redundant encoding, using a customized color range:
378
+ """)
379
+ return
380
+
381
+
382
+ @app.cell
383
+ def _(alt, weather):
384
+ _colors = alt.Scale(domain=['drizzle', 'fog', 'rain', 'snow', 'sun'], range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52'])
385
+ alt.Chart(weather).mark_bar().transform_filter('datum.location == "Seattle"').encode(alt.X('temp_max:Q', bin=True, title='Temperature (°C)'), alt.Y('count():Q'), alt.Color('weather:N', scale=_colors), alt.Column('weather:N')).properties(width=150, height=150)
386
+ return
387
+
388
+
389
+ @app.cell(hide_code=True)
390
+ def _(mo):
391
+ mo.md(r"""
392
+ _Unsurprisingly, those rare snow days center on the coldest temperatures, followed by rainy and foggy days. Sunny days are warmer and, despite Seattle stereotypes, are the most plentiful. Though as any Seattleite can tell you, the drizzle occasionally comes, no matter the temperature!_
393
+ """)
394
+ return
395
+
396
+
397
+ @app.cell(hide_code=True)
398
+ def _(mo):
399
+ mo.md(r"""
400
+ In addition to `row` and `column` encoding channels *within* a chart definition, we can take a basic chart definition and apply faceting using an explicit `facet` operator.
401
+
402
+ Let's recreate the chart above, but this time using `facet`. We start with the same basic histogram definition, but remove the data source, filter transform, and column channel. We can then invoke the `facet` method, passing in the data and specifying that we should facet into columns according to the `weather` field. The `facet` method accepts both `row` and `column` arguments. The two can be used together to create a 2D grid of faceted plots.
403
+
404
+ Finally we include our filter transform, applying it to the top-level faceted chart. While we could apply the filter transform to the histogram definition as before, that is slightly less efficient. Rather than filter out "New York" values within each facet cell, applying the filter to the faceted chart lets Vega-Lite know that we can filter out those values up front, prior to the facet subdivision.
405
+ """)
406
+ return
407
+
408
+
409
+ @app.cell
410
+ def _(alt, weather):
411
+ _colors = alt.Scale(domain=['drizzle', 'fog', 'rain', 'snow', 'sun'], range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52'])
412
+ alt.Chart().mark_bar().encode(alt.X('temp_max:Q', bin=True, title='Temperature (°C)'), alt.Y('count():Q'), alt.Color('weather:N', scale=_colors)).properties(width=150, height=150).facet(data=weather, column='weather:N').transform_filter('datum.location == "Seattle"')
413
+ return
414
+
415
+
416
+ @app.cell(hide_code=True)
417
+ def _(mo):
418
+ mo.md(r"""
419
+ Given all the extra code above, why would we want to use an explicit `facet` operator? For basic charts, we should certainly use the `column` or `row` encoding channels if we can. However, using the `facet` operator explicitly is useful if we want to facet composed views, such as layered charts.
420
+
421
+ Let's revisit our layered temperature plots from earlier. Instead of plotting data for New York and Seattle in the same plot, let's break them up into separate facets. The individual chart definitions are nearly the same as before: one area chart and one line chart. The only difference is that this time we won't pass the data directly to the chart constructors; we'll wait and pass it to the facet operator later. We can layer the charts much as before, then invoke `facet` on the layered chart object, passing in the data and specifying `column` facets based on the `location` field:
422
+ """)
423
+ return
424
+
425
+
426
+ @app.cell
427
+ def _(alt, weather):
428
+ tempMinMax_5 = alt.Chart().mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature (°C)'), alt.Y2('average(temp_min):Q'), alt.Color('location:N'))
429
+ tempMid_2 = alt.Chart().mark_line().transform_calculate(temp_mid='(+datum.temp_min + +datum.temp_max) / 2').encode(alt.X('month(date):T'), alt.Y('average(temp_mid):Q'), alt.Color('location:N'))
430
+ alt.layer(tempMinMax_5, tempMid_2).facet(data=weather, column='location:N')
431
+ return
432
+
433
+
434
+ @app.cell(hide_code=True)
435
+ def _(mo):
436
+ mo.md(r"""
437
+ The faceted charts we have seen so far use the same axis scale domains across the facet cells. This default of using *shared* scales and axes helps aid accurate comparison of values. However, in some cases you may wish to scale each chart independently, for example if the range of values in the cells differs significantly.
438
+
439
+ Similar to layered charts, faceted charts also support _resolving_ to independent scales or axes across plots. Let's see what happens if we call the `resolve_axis` method to request `independent` y-axes:
440
+ """)
441
+ return
442
+
443
+
444
+ @app.cell
445
+ def _(alt, weather):
446
+ tempMinMax_6 = alt.Chart().mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature (°C)'), alt.Y2('average(temp_min):Q'), alt.Color('location:N'))
447
+ tempMid_3 = alt.Chart().mark_line().transform_calculate(temp_mid='(+datum.temp_min + +datum.temp_max) / 2').encode(alt.X('month(date):T'), alt.Y('average(temp_mid):Q'), alt.Color('location:N'))
448
+ alt.layer(tempMinMax_6, tempMid_3).facet(data=weather, column='location:N').resolve_axis(y='independent')
449
+ return
450
+
451
+
452
+ @app.cell(hide_code=True)
453
+ def _(mo):
454
+ mo.md(r"""
455
+ _The chart above looks largely unchanged, but the plot for Seattle now includes its own axis._
456
+
457
+ What if we instead call `resolve_scale` to resolve the underlying scale domains?
458
+ """)
459
+ return
460
+
461
+
462
+ @app.cell
463
+ def _(alt, weather):
464
+ tempMinMax_7 = alt.Chart().mark_area(opacity=0.3).encode(alt.X('month(date):T', title=None, axis=alt.Axis(format='%b')), alt.Y('average(temp_max):Q', title='Avg. Temperature (°C)'), alt.Y2('average(temp_min):Q'), alt.Color('location:N'))
465
+ tempMid_4 = alt.Chart().mark_line().transform_calculate(temp_mid='(+datum.temp_min + +datum.temp_max) / 2').encode(alt.X('month(date):T'), alt.Y('average(temp_mid):Q'), alt.Color('location:N'))
466
+ alt.layer(tempMinMax_7, tempMid_4).facet(data=weather, column='location:N').resolve_scale(y='independent')
467
+ return
468
+
469
+
470
+ @app.cell(hide_code=True)
471
+ def _(mo):
472
+ mo.md(r"""
473
+ _Now we see facet cells with different axis scale domains. In this case, using independent scales seems like a bad idea! The domains aren't very different, and one might be fooled into thinking that New York and Seattle have similar maximum summer temperatures._
474
+
475
+ To borrow a cliché: just because you *can* do something, doesn't mean you *should*...
476
+ """)
477
+ return
478
+
479
+
480
+ @app.cell(hide_code=True)
481
+ def _(mo):
482
+ mo.md(r"""
483
+ ## Concatenate
484
+ """)
485
+ return
486
+
487
+
488
+ @app.cell(hide_code=True)
489
+ def _(mo):
490
+ mo.md(r"""
491
+ Faceting creates [small multiple](https://en.wikipedia.org/wiki/Small_multiple) plots that show separate subdivisions of the data. However, we might wish to create a multi-view display with different views of the *same* dataset (not subsets) or views involving *different* datasets.
492
+
493
+ Altair provides *concatenation* operators to combine arbitrary charts into a composed chart. The `hconcat` operator (shorthand `|` ) performs horizontal concatenation, while the `vconcat` operator (shorthand `&`) performs vertical concatenation.
494
+ """)
495
+ return
496
+
497
+
498
+ @app.cell(hide_code=True)
499
+ def _(mo):
500
+ mo.md(r"""
501
+ Let's start with a basic line chart showing the average maximum temperature per month for both New York and Seattle, much like we've seen before:
502
+ """)
503
+ return
504
+
505
+
506
+ @app.cell
507
+ def _(alt, weather):
508
+ alt.Chart(weather).mark_line().encode(
509
+ alt.X('month(date):T', title=None),
510
+ alt.Y('average(temp_max):Q'),
511
+ color='location:N'
512
+ )
513
+ return
514
+
515
+
516
+ @app.cell(hide_code=True)
517
+ def _(mo):
518
+ mo.md(r"""
519
+ _What if we want to compare not just temperature over time, but also precipitation and wind levels?_
520
+
521
+ Let's create a concatenated chart consisting of three plots. We'll start by defining a "base" chart definition that contains all the aspects that should be shared by our three plots. We can then modify this base chart to create customized variants, with different y-axis encodings for the `temp_max`, `precipitation`, and `wind` fields. We can then concatenate them using the pipe (`|`) shorthand operator:
522
+ """)
523
+ return
524
+
525
+
526
+ @app.cell
527
+ def _(alt, weather):
528
+ base = alt.Chart(weather).mark_line().encode(alt.X('month(date):T', title=None), color='location:N').properties(width=240, height=180)
529
+ temp = base.encode(alt.Y('average(temp_max):Q'))
530
+ _precip = base.encode(alt.Y('average(precipitation):Q'))
531
+ wind = base.encode(alt.Y('average(wind):Q'))
532
+ temp | _precip | wind
533
+ return
534
+
535
+
536
+ @app.cell(hide_code=True)
537
+ def _(mo):
538
+ mo.md(r"""
539
+ Alternatively, we could use the more explicit `alt.hconcat()` method in lieu of the pipe `|` operator. _Try rewriting the code above to use `hconcat` instead._
540
+
541
+ Vertical concatenation works similarly to horizontal concatenation. _Using the `&` operator (or `alt.vconcat` method), modify the code to use a vertical ordering instead of a horizontal ordering._
542
+
543
+ Finally, note that horizontal and vertical concatenation can be combined. _What happens if you write something like `(temp | precip) & wind`?_
544
+
545
+ _Aside_: Note the importance of those parentheses... what happens if you remove them? Keep in mind that these overloaded operators are still subject to [Python's operator precendence rules](https://docs.python.org/3/reference/expressions.html#operator-precedence), and so vertical concatenation with `&` will take precedence over horizontal concatenation with `|`!
546
+
547
+ As we will revisit later, concatenation operators let you combine any and all charts into a multi-view dashboard!
548
+ """)
549
+ return
550
+
551
+
552
+ @app.cell(hide_code=True)
553
+ def _(mo):
554
+ mo.md(r"""
555
+ ## Repeat
556
+ """)
557
+ return
558
+
559
+
560
+ @app.cell(hide_code=True)
561
+ def _(mo):
562
+ mo.md(r"""
563
+ The concatenation operators above are quite general, allowing arbitrary charts to be composed. Nevertheless, the example above was still a bit verbose: we have three very similar charts, yet have to define them separately and then concatenate them.
564
+
565
+ For cases where only one or two variables are changing, the `repeat` operator provides a convenient shortcut for creating multiple charts. Given a *template* specification with some free variables, the repeat operator will then create a chart for each specified assignment to those variables.
566
+
567
+ Let's recreate our concatenation example above using the `repeat` operator. The only aspect that changes across charts is the choice of data field for the `y` encoding channel. To create a template specification, we can use the *repeater variable* `alt.repeat('column')` as our y-axis field. This code simply states that we want to use the variable assigned to the `column` repeater, which organizes repeated charts in a horizontal direction. (As the repeater provides the field name only, we have to specify the field data type separately as `type='quantitative'`.)
568
+
569
+ We then invoke the `repeat` method, passing in data field names for each column:
570
+ """)
571
+ return
572
+
573
+
574
+ @app.cell
575
+ def _(alt, weather):
576
+ alt.Chart(weather).mark_line().encode(
577
+ alt.X('month(date):T',title=None),
578
+ alt.Y(alt.repeat('column'), aggregate='average', type='quantitative'),
579
+ color='location:N'
580
+ ).properties(
581
+ width=240,
582
+ height=180
583
+ ).repeat(
584
+ column=['temp_max', 'precipitation', 'wind']
585
+ )
586
+ return
587
+
588
+
589
+ @app.cell(hide_code=True)
590
+ def _(mo):
591
+ mo.md(r"""
592
+ Repetition is supported for both columns and rows. _What happens if you modify the code above to use `row` instead of `column`?_
593
+
594
+ We can also use `row` and `column` repetition together! One common visualization for exploratory data analysis is the [scatter plot matrix (or SPLOM)](https://en.wikipedia.org/wiki/Scatter_plot#Scatterplot_matrices). Given a collection of variables to inspect, a SPLOM provides a grid of all pairwise plots of those variables, allowing us to assess potential associations.
595
+
596
+ Let's use the `repeat` operator to create a SPLOM for the `temp_max`, `precipitation`, and `wind` fields. We first create our template specification, with repeater variables for both the x- and y-axis data fields. We then invoke `repeat`, passing in arrays of field names to use for both `row` and `column`. Altair will then generate the [cross product (or, Cartesian product)](https://en.wikipedia.org/wiki/Cartesian_product) to create the full space of repeated charts:
597
+ """)
598
+ return
599
+
600
+
601
+ @app.cell
602
+ def _(alt, weather):
603
+ alt.Chart().mark_point(filled=True, size=15, opacity=0.5).encode(
604
+ alt.X(alt.repeat('column'), type='quantitative'),
605
+ alt.Y(alt.repeat('row'), type='quantitative')
606
+ ).properties(
607
+ width=150,
608
+ height=150
609
+ ).repeat(
610
+ data=weather,
611
+ row=['temp_max', 'precipitation', 'wind'],
612
+ column=['wind', 'precipitation', 'temp_max']
613
+ ).transform_filter(
614
+ 'datum.location == "Seattle"'
615
+ )
616
+ return
617
+
618
+
619
+ @app.cell(hide_code=True)
620
+ def _(mo):
621
+ mo.md(r"""
622
+ _Looking at these plots, there does not appear to be a strong association between precipitation and wind, though we do see that extreme wind and precipitation events occur in similar temperature ranges (~5-15° C). However, this observation is not particularly surprising: if we revisit our histogram at the beginning of the facet section, we can plainly see that the days with maximum temperatures in the range of 5-15° C are the most commonly occurring._
623
+
624
+ *Modify the code above to get a better understanding of chart repetition. Try adding another variable (`temp_min`) to the SPLOM. What happens if you rearrange the order of the field names in either the `row` or `column` parameters for the `repeat` operator?*
625
+
626
+ _Finally, to really appreciate what the `repeat` operator provides, take a moment to imagine how you might recreate the SPLOM above using only `hconcat` and `vconcat`!_
627
+ """)
628
+ return
629
+
630
+
631
+ @app.cell(hide_code=True)
632
+ def _(mo):
633
+ mo.md(r"""
634
+ ## A View Composition Algebra
635
+ """)
636
+ return
637
+
638
+
639
+ @app.cell(hide_code=True)
640
+ def _(mo):
641
+ mo.md(r"""
642
+ Together, the composition operators `layer`, `facet`, `concat`, and `repeat` form a *view composition algebra*: the various operators can be combined to construct a variety of multi-view visualizations.
643
+
644
+ As an example, let's start with two basic charts: a histogram and a simple line (a single `rule` mark) showing a global average.
645
+ """)
646
+ return
647
+
648
+
649
+ @app.cell
650
+ def _(alt, weather):
651
+ basic1 = alt.Chart(weather).transform_filter(
652
+ 'datum.location == "Seattle"'
653
+ ).mark_bar().encode(
654
+ alt.X('month(date):O'),
655
+ alt.Y('average(temp_max):Q')
656
+ )
657
+
658
+ basic2 = alt.Chart(weather).transform_filter(
659
+ 'datum.location == "Seattle"'
660
+ ).mark_rule(stroke='firebrick').encode(
661
+ alt.Y('average(temp_max):Q')
662
+ )
663
+
664
+ basic1 | basic2
665
+ return
666
+
667
+
668
+ @app.cell(hide_code=True)
669
+ def _(mo):
670
+ mo.md(r"""
671
+ We can then combine the two charts using a `layer` operator, and then `repeat` that layered chart to show histograms with overlaid averages for multiple fields:
672
+ """)
673
+ return
674
+
675
+
676
+ @app.cell
677
+ def _(alt, weather):
678
+ alt.layer(
679
+ alt.Chart().mark_bar().encode(
680
+ alt.X('month(date):O', title='Month'),
681
+ alt.Y(alt.repeat('column'), aggregate='average', type='quantitative')
682
+ ),
683
+ alt.Chart().mark_rule(stroke='firebrick').encode(
684
+ alt.Y(alt.repeat('column'), aggregate='average', type='quantitative')
685
+ )
686
+ ).properties(
687
+ width=200,
688
+ height=150
689
+ ).repeat(
690
+ data=weather,
691
+ column=['temp_max', 'precipitation', 'wind']
692
+ ).transform_filter(
693
+ 'datum.location == "Seattle"'
694
+ )
695
+ return
696
+
697
+
698
+ @app.cell(hide_code=True)
699
+ def _(mo):
700
+ mo.md(r"""
701
+ Focusing only on the multi-view composition operators, the model for the visualization above is:
702
+
703
+ ```
704
+ repeat(column=[...])
705
+ |- layer
706
+ |- basic1
707
+ |- basic2
708
+ ```
709
+
710
+ Now let's explore how we can apply *all* the operators within a final [dashboard](https://en.wikipedia.org/wiki/Dashboard_%28business%29) that provides an overview of Seattle weather. We'll combine the SPLOM and faceted histogram displays from earlier sections with the repeated histograms above:
711
+ """)
712
+ return
713
+
714
+
715
+ @app.cell
716
+ def _(alt, weather):
717
+ splom = alt.Chart().mark_point(filled=True, size=15, opacity=0.5).encode(
718
+ alt.X(alt.repeat('column'), type='quantitative'),
719
+ alt.Y(alt.repeat('row'), type='quantitative')
720
+ ).properties(
721
+ width=125,
722
+ height=125
723
+ ).repeat(
724
+ row=['temp_max', 'precipitation', 'wind'],
725
+ column=['wind', 'precipitation', 'temp_max']
726
+ )
727
+
728
+ dateHist = alt.layer(
729
+ alt.Chart().mark_bar().encode(
730
+ alt.X('month(date):O', title='Month'),
731
+ alt.Y(alt.repeat('row'), aggregate='average', type='quantitative')
732
+ ),
733
+ alt.Chart().mark_rule(stroke='firebrick').encode(
734
+ alt.Y(alt.repeat('row'), aggregate='average', type='quantitative')
735
+ )
736
+ ).properties(
737
+ width=175,
738
+ height=125
739
+ ).repeat(
740
+ row=['temp_max', 'precipitation', 'wind']
741
+ )
742
+
743
+ tempHist = alt.Chart(weather).mark_bar().encode(
744
+ alt.X('temp_max:Q', bin=True, title='Temperature (°C)'),
745
+ alt.Y('count():Q'),
746
+ alt.Color('weather:N', scale=alt.Scale(
747
+ domain=['drizzle', 'fog', 'rain', 'snow', 'sun'],
748
+ range=['#aec7e8', '#c7c7c7', '#1f77b4', '#9467bd', '#e7ba52']
749
+ ))
750
+ ).properties(
751
+ width=115,
752
+ height=100
753
+ ).facet(
754
+ column='weather:N'
755
+ )
756
+
757
+ alt.vconcat(
758
+ alt.hconcat(splom, dateHist),
759
+ tempHist,
760
+ data=weather,
761
+ title='Seattle Weather Dashboard'
762
+ ).transform_filter(
763
+ 'datum.location == "Seattle"'
764
+ ).resolve_legend(
765
+ color='independent'
766
+ ).configure_axis(
767
+ labelAngle=0
768
+ )
769
+ return
770
+
771
+
772
+ @app.cell(hide_code=True)
773
+ def _(mo):
774
+ mo.md(r"""
775
+ The full composition model for this dashboard is:
776
+
777
+ ```
778
+ vconcat
779
+ |- hconcat
780
+ | |- repeat(row=[...], column=[...])
781
+ | | |- splom base chart
782
+ | |- repeat(row=[...])
783
+ | |- layer
784
+ | |- dateHist base chart 1
785
+ | |- dateHist base chart 2
786
+ |- facet(column='weather')
787
+ |- tempHist base chart
788
+ ```
789
+
790
+ _Phew!_ The dashboard also includes a few customizations to improve the layout:
791
+
792
+ - We adjust chart `width` and `height` properties to assist alignment and ensure the full visualization fits on the screen.
793
+ - We add `resolve_legend(color='independent')` to ensure the color legend is associated directly with the colored histograms by temperature. Otherwise, the legend will resolve to the dashboard as a whole.
794
+ - We use `configure_axis(labelAngle=0)` to ensure that no axis labels are rotated. This helps to ensure proper alignment among the scatter plots in the SPLOM and the histograms by month on the right.
795
+
796
+ _Try removing or modifying any of these adjustments and see how the dashboard layout responds!_
797
+
798
+ This dashboard can be reused to show data for other locations or from other datasets. _Update the dashboard to show weather patterns for New York instead of Seattle._
799
+ """)
800
+ return
801
+
802
+
803
+ @app.cell(hide_code=True)
804
+ def _(mo):
805
+ mo.md(r"""
806
+ ## Summary
807
+
808
+ For more details on multi-view composition, including control over sub-plot spacing and header labels, see the [Altair Compound Charts documentation](https://altair-viz.github.io/user_guide/compound_charts.html).
809
+
810
+ Now that we've seen how to compose multiple views, we're ready to put them into action. In addition to statically presenting data, multiple views can enable interactive multi-dimensional exploration. For example, using _linked selections_ we can highlight points in one view to see corresponding values highlight in other views.
811
+
812
+ In the next notebook, we'll examine how to author *interactive selections* for both individual plots and multi-view compositions.
813
+ """)
814
+ return
815
+
816
+
817
+ if __name__ == "__main__":
818
+ app.run()
altair/06_interaction.py ADDED
@@ -0,0 +1,671 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # ]
8
+ # ///
9
+
10
+ import marimo
11
+
12
+ __generated_with = "0.20.4"
13
+ app = marimo.App()
14
+
15
+
16
+ @app.cell
17
+ def _():
18
+ import marimo as mo
19
+
20
+ return (mo,)
21
+
22
+
23
+ @app.cell(hide_code=True)
24
+ def _(mo):
25
+ mo.md(r"""
26
+ # Interaction
27
+
28
+ _“A graphic is not ‘drawn’ once and for all; it is ‘constructed’ and reconstructed until it reveals all the relationships constituted by the interplay of the data. The best graphic operations are those carried out by the decision-maker themself.”_ &mdash; [Jacques Bertin](https://books.google.com/books?id=csqX_xnm4tcC)
29
+
30
+ Visualization provides a powerful means of making sense of data. A single image, however, typically provides answers to, at best, a handful of questions. Through _interaction_ we can transform static images into tools for exploration: highlighting points of interest, zooming in to reveal finer-grained patterns, and linking across multiple views to reason about multi-dimensional relationships.
31
+
32
+ At the core of interaction is the notion of a _selection_: a means of indicating to the computer which elements or regions we are interested in. For example, we might hover the mouse over a point, click multiple marks, or draw a bounding box around a region to highlight subsets of the data for further scrutiny.
33
+
34
+ Alongside visual encodings and data transformations, Altair provides a _selection_ abstraction for authoring interactions. These selections encompass three aspects:
35
+
36
+ 1. Input event handling to select points or regions of interest, such as mouse hover, click, drag, scroll, and touch events.
37
+ 2. Generalizing from the input to form a selection rule (or [_predicate_](https://en.wikipedia.org/wiki/Predicate_%28mathematical_logic%29)) that determines whether or not a given data record lies within the selection.
38
+ 3. Using the selection predicate to dynamically configure a visualization by driving _conditional encodings_, _filter transforms_, or _scale domains_.
39
+
40
+ This notebook introduces interactive selections and explores how to use them to author a variety of interaction techniques, such as dynamic queries, panning &amp; zooming, details-on-demand, and brushing &amp; linking.
41
+
42
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
43
+ """)
44
+ return
45
+
46
+
47
+ @app.cell
48
+ def _():
49
+ import pandas as pd
50
+ import altair as alt
51
+
52
+ return alt, pd
53
+
54
+
55
+ @app.cell(hide_code=True)
56
+ def _(mo):
57
+ mo.md(r"""
58
+ ## Datasets
59
+ """)
60
+ return
61
+
62
+
63
+ @app.cell(hide_code=True)
64
+ def _(mo):
65
+ mo.md(r"""
66
+ We will visualize a variety of datasets from the [vega-datasets](https://github.com/vega/vega-datasets) collection:
67
+
68
+ - A dataset of `cars` from the 1970s and early 1980s,
69
+ - A dataset of `movies`, previously used in the [Data Transformation](https://github.com/uwdata/visualization-curriculum/blob/master/altair_data_transformation.ipynb) notebook,
70
+ - A dataset containing ten years of [S&amp;P 500](https://en.wikipedia.org/wiki/S%26P_500_Index) (`sp500`) stock prices,
71
+ - A dataset of technology company `stocks`, and
72
+ - A dataset of `flights`, including departure time, distance, and arrival delay.
73
+ """)
74
+ return
75
+
76
+
77
+ @app.cell
78
+ def _():
79
+ cars = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/cars.json'
80
+ movies = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
81
+ sp500 = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/sp500.csv'
82
+ stocks = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/stocks.csv'
83
+ flights = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/flights-5k.json'
84
+ return cars, flights, movies, sp500, stocks
85
+
86
+
87
+ @app.cell(hide_code=True)
88
+ def _(mo):
89
+ mo.md(r"""
90
+ ## Introducing Selections
91
+ """)
92
+ return
93
+
94
+
95
+ @app.cell(hide_code=True)
96
+ def _(mo):
97
+ mo.md(r"""
98
+ Let's start with a basic selection: simply clicking a point to highlight it. Using the `cars` dataset, we'll start with a scatter plot of horsepower versus miles per gallon, with a color encoding for the number cylinders in the car engine.
99
+
100
+ In addition, we'll create a selection instance by calling `alt.selection_single()`, indicating we want a selection defined over a _single value_. By default, the selection uses a mouse click to determine the selected value. To register a selection with a chart, we must add it using the `.add_params()` method.
101
+
102
+ Once our selection has been defined, we can use it as a parameter for _conditional encodings_, which apply a different encoding depending on whether a data record lies in or out of the selection. For example, consider the following code:
103
+
104
+ ~~~ python
105
+ color=alt.condition(selection, 'Cylinders:O', alt.value('grey'))
106
+ ~~~
107
+
108
+ This encoding definition states that data points contained within the `selection` should be colored according to the `Cylinder` field, while non-selected data points should use a default `grey`. An empty selection includes _all_ data points, and so initially all points will be colored.
109
+
110
+ _Try clicking different points in the chart below. What happens? (Click the background to clear the selection state and return to an "empty" selection.)_
111
+ """)
112
+ return
113
+
114
+
115
+ @app.cell
116
+ def _(alt, cars):
117
+ _selection = alt.selection_point(toggle=False)
118
+ alt.Chart(cars).mark_circle().add_params(_selection).encode(x='Horsepower:Q', y='Miles_per_Gallon:Q', color=alt.condition(_selection, 'Cylinders:O', alt.value('grey')), opacity=alt.condition(_selection, alt.value(0.8), alt.value(0.1)))
119
+ return
120
+
121
+
122
+ @app.cell(hide_code=True)
123
+ def _(mo):
124
+ mo.md(r"""
125
+ Of course, highlighting individual data points one-at-a-time is not particularly exciting! As we'll see, however, single value selections provide a useful building block for more powerful interactions. Moreover, single value selections are just one of the three selection types provided by Altair:
126
+
127
+ - `selection_single` - select a single discrete value, by default on click events.
128
+ - `selection_multi` - select multiple discrete values. The first value is selected on mouse click and additional values toggled using shift-click.
129
+ - `selection_interval` - select a continuous range of values, initiated by mouse drag.
130
+
131
+ Let's compare each of these selection types side-by-side. To keep our code tidy we'll first define a function (`plot`) that generates a scatter plot specification just like the one above. We can pass a selection to the `plot` function to have it applied to the chart:
132
+ """)
133
+ return
134
+
135
+
136
+ @app.cell
137
+ def _(alt, cars):
138
+ def plot(selection):
139
+ return alt.Chart(cars).mark_circle().add_params(selection).encode(x='Horsepower:Q', y='Miles_per_Gallon:Q', color=alt.condition(selection, 'Cylinders:O', alt.value('grey')), opacity=alt.condition(selection, alt.value(0.8), alt.value(0.1))).properties(width=240, height=180)
140
+
141
+ return (plot,)
142
+
143
+
144
+ @app.cell(hide_code=True)
145
+ def _(mo):
146
+ mo.md(r"""
147
+ Let's use our `plot` function to create three chart variants, one per selection type.
148
+
149
+ The first (`single`) chart replicates our earlier example. The second (`multi`) chart supports shift-click interactions to toggle inclusion of multiple points within the selection. The third (`interval`) chart generates a selection region (or _brush_) upon mouse drag. Once created, you can drag the brush around to select different points, or scroll when the cursor is inside the brush to scale (zoom) the brush size.
150
+
151
+ _Try interacting with each of the charts below!_
152
+ """)
153
+ return
154
+
155
+
156
+ @app.cell
157
+ def _(alt, plot):
158
+ alt.hconcat(
159
+ plot(alt.selection_point(toggle=False)).properties(title='Single (Click)'),
160
+ plot(alt.selection_point()).properties(title='Multi (Shift-Click)'),
161
+ plot(alt.selection_interval()).properties(title='Interval (Drag)')
162
+ )
163
+ return
164
+
165
+
166
+ @app.cell(hide_code=True)
167
+ def _(mo):
168
+ mo.md(r"""
169
+ The examples above use default interactions (click, shift-click, drag) for each selection type. We can further customize the interactions by providing input event specifications using [Vega event selector syntax](https://vega.github.io/vega/docs/event-streams/). For example, we can modify our `single` and `multi` charts to trigger upon `mouseover` events instead of `click` events.
170
+
171
+ _Hold down the shift key in the second chart to "paint" with data!_
172
+ """)
173
+ return
174
+
175
+
176
+ @app.cell
177
+ def _(alt, plot):
178
+ alt.hconcat(
179
+ plot(alt.selection_point(toggle=False, on='mouseover')).properties(title='Single (Mouseover)'),
180
+ plot(alt.selection_point(on='mouseover')).properties(title='Multi (Shift-Mouseover)')
181
+ )
182
+ return
183
+
184
+
185
+ @app.cell(hide_code=True)
186
+ def _(mo):
187
+ mo.md(r"""
188
+ Now that we've covered the basics of Altair selections, let's take a tour through the various interaction techniques they enable!
189
+ """)
190
+ return
191
+
192
+
193
+ @app.cell(hide_code=True)
194
+ def _(mo):
195
+ mo.md(r"""
196
+ ## Dynamic Queries
197
+ """)
198
+ return
199
+
200
+
201
+ @app.cell(hide_code=True)
202
+ def _(mo):
203
+ mo.md(r"""
204
+ _Dynamic queries_ enables rapid, reversible exploration of data to isolate patterns of interest. As defined by [Ahlberg, Williamson, &amp; Shneiderman](https://www.cs.umd.edu/~ben/papers/Ahlberg1992Dynamic.pdf), a dynamic query:
205
+
206
+ - represents a query graphically,
207
+ - provides visible limits on the query range,
208
+ - provides a graphical representation of the data and query result,
209
+ - gives immediate feedback of the result after every query adjustment,
210
+ - and allows novice users to begin working with little training.
211
+
212
+ A common approach is to manipulate query parameters using standard user interface widgets such as sliders, radio buttons, and drop-down menus. To generate dynamic query widgets, we can apply a selection's `bind` operation to one or more data fields we wish to query.
213
+
214
+ Let's build an interactive scatter plot that uses a dynamic query to filter the display. Given a scatter plot of movie ratings (from Rotten Tomates and IMDB), we can add a selection over the `Major_Genre` field to enable interactive filtering by film genre.
215
+ """)
216
+ return
217
+
218
+
219
+ @app.cell(hide_code=True)
220
+ def _(mo):
221
+ mo.md(r"""
222
+ To start, let's extract the unique (non-null) genres from the `movies` data:
223
+ """)
224
+ return
225
+
226
+
227
+ @app.cell
228
+ def _(movies, pd):
229
+ df = pd.read_json(movies) # load movies data
230
+ genres = df['Major_Genre'].unique() # get unique field values
231
+ genres = list(filter(pd.notna, genres)) # filter out None/NaN values
232
+ genres.sort() # sort alphabetically
233
+ return (genres,)
234
+
235
+
236
+ @app.cell(hide_code=True)
237
+ def _(mo):
238
+ mo.md(r"""
239
+ For later use, let's also define a list of unique `MPAA_Rating` values:
240
+ """)
241
+ return
242
+
243
+
244
+ @app.cell
245
+ def _():
246
+ mpaa = ['G', 'PG', 'PG-13', 'R', 'NC-17', 'Not Rated']
247
+ return (mpaa,)
248
+
249
+
250
+ @app.cell(hide_code=True)
251
+ def _(mo):
252
+ mo.md(r"""
253
+ Now let's create a `single` selection bound to a drop-down menu.
254
+
255
+ *Use the dynamic query menu below to explore the data. How do ratings vary by genre? How would you revise the code to filter `MPAA_Rating` (G, PG, PG-13, etc.) instead of `Major_Genre`?*
256
+ """)
257
+ return
258
+
259
+
260
+ @app.cell
261
+ def _(alt, genres, movies):
262
+ selectGenre = alt.selection_point(
263
+ toggle=False,
264
+ name='Select', # name the selection 'Select'
265
+ fields=['Major_Genre'], # limit selection to the Major_Genre field
266
+ value=[{'Major_Genre': genres[0]}], # use first genre entry as initial value
267
+ bind=alt.binding_select(options=genres) # bind to a menu of unique genre values
268
+ )
269
+
270
+ alt.Chart(movies).mark_circle().add_params(
271
+ selectGenre
272
+ ).encode(
273
+ x='Rotten_Tomatoes_Rating:Q',
274
+ y='IMDB_Rating:Q',
275
+ tooltip='Title:N',
276
+ opacity=alt.condition(selectGenre, alt.value(0.75), alt.value(0.05))
277
+ )
278
+ return
279
+
280
+
281
+ @app.cell(hide_code=True)
282
+ def _(mo):
283
+ mo.md(r"""
284
+ Our construction above leverages multiple aspects of selections:
285
+
286
+ - We give the selection a name (`'Select'`). This name is not required, but allows us to influence the label text of the generated dynamic query menu. (_What happens if you remove the name? Try it!_)
287
+ - We constrain the selection to a specific data field (`Major_Genre`). Earlier when we used a `single` selection, the selection mapped to individual data points. By limiting the selection to a specific field, we can select _all_ data points whose `Major_Genre` field value matches the single selected value.
288
+ - We initialize `init=...` the selection to a starting value.
289
+ - We `bind` the selection to an interface widget, in this case a drop-down menu via `binding_select`.
290
+ - As before, we then use a conditional encoding to control the opacity channel.
291
+ """)
292
+ return
293
+
294
+
295
+ @app.cell(hide_code=True)
296
+ def _(mo):
297
+ mo.md(r"""
298
+ ### Binding Selections to Multiple Inputs
299
+
300
+ One selection instance can be bound to _multiple_ dynamic query widgets. Let's modify the example above to provide filters for _both_ `Major_Genre` and `MPAA_Rating`, using radio buttons instead of a menu. Our `single` selection is now defined over a single _pair_ of genre and MPAA rating values
301
+
302
+ _Look for surprising conjunctions of genre and rating. Are there any G or PG-rated horror films?_
303
+ """)
304
+ return
305
+
306
+
307
+ @app.cell
308
+ def _(alt, genres, movies, mpaa):
309
+ # single-value selection over [Major_Genre, MPAA_Rating] pairs
310
+ # use specific hard-wired values as the initial selected values
311
+ _selection = alt.selection_point(toggle=False, name='Select', fields=['Major_Genre', 'MPAA_Rating'], value=[{'Major_Genre': 'Drama', 'MPAA_Rating': 'R'}], bind={'Major_Genre': alt.binding_select(options=genres), 'MPAA_Rating': alt.binding_radio(options=mpaa)})
312
+ # scatter plot, modify opacity based on selection
313
+ alt.Chart(movies).mark_circle().add_params(_selection).encode(x='Rotten_Tomatoes_Rating:Q', y='IMDB_Rating:Q', tooltip='Title:N', opacity=alt.condition(_selection, alt.value(0.75), alt.value(0.05)))
314
+ return
315
+
316
+
317
+ @app.cell(hide_code=True)
318
+ def _(mo):
319
+ mo.md(r"""
320
+ _Fun facts: The PG-13 rating didn't exist when the movies [Jaws](https://www.imdb.com/title/tt0073195/) and [Jaws 2](https://www.imdb.com/title/tt0077766/) were released. The first film to receive a PG-13 rating was 1984's [Red Dawn](https://www.imdb.com/title/tt0087985/)._
321
+ """)
322
+ return
323
+
324
+
325
+ @app.cell(hide_code=True)
326
+ def _(mo):
327
+ mo.md(r"""
328
+ ### Using Visualizations as Dynamic Queries
329
+
330
+ Though standard interface widgets show the _possible_ query parameter values, they do not visualize the _distribution_ of those values. We might also wish to use richer interactions, such as multi-value or interval selections, rather than input widgets that select only a single value at a time.
331
+
332
+ To address these issues, we can author additional charts to both visualize data and support dynamic queries. Let's add a histogram of the count of films per year and use an interval selection to dynamically highlight films over selected time periods.
333
+
334
+ *Interact with the year histogram to explore films from different time periods. Do you seen any evidence of [sampling bias](https://en.wikipedia.org/wiki/Sampling_bias) across the years? (How do year and critics' ratings relate?)*
335
+
336
+ _The years range from 1930 to 2040! Are future films in pre-production, or are there "off-by-one century" errors? Also, depending on which time zone you're in, you may see a small bump in either 1969 or 1970. Why might that be? (See the end of the notebook for an explanation!)_
337
+ """)
338
+ return
339
+
340
+
341
+ @app.cell
342
+ def _(alt, movies):
343
+ _brush = alt.selection_interval(encodings=['x'])
344
+ years = alt.Chart(movies).mark_bar().add_params(_brush).encode(alt.X('year(Release_Date):T', title='Films by Release Year'), alt.Y('count():Q', title=None)).properties(width=650, height=50) # limit selection to x-axis (year) values
345
+ ratings = alt.Chart(movies).mark_circle().encode(x='Rotten_Tomatoes_Rating:Q', y='IMDB_Rating:Q', tooltip='Title:N', opacity=alt.condition(_brush, alt.value(0.75), alt.value(0.05))).properties(width=650, height=400)
346
+ # dynamic query histogram
347
+ # scatter plot, modify opacity based on selection
348
+ alt.vconcat(years, ratings).properties(spacing=5)
349
+ return
350
+
351
+
352
+ @app.cell(hide_code=True)
353
+ def _(mo):
354
+ mo.md(r"""
355
+ The example above provides dynamic queries using a _linked selection_ between charts:
356
+
357
+ - We create an `interval` selection (`brush`), and set `encodings=['x']` to limit the selection to the x-axis only, resulting in a one-dimensional selection interval.
358
+ - We register `brush` with our histogram of films per year via `.add_params(brush)`.
359
+ - We use `brush` in a conditional encoding to adjust the scatter plot `opacity`.
360
+
361
+ This interaction technique of selecting elements in one chart and seeing linked highlights in one or more other charts is known as [_brushing &amp; linking_](https://en.wikipedia.org/wiki/Brushing_and_linking).
362
+ """)
363
+ return
364
+
365
+
366
+ @app.cell(hide_code=True)
367
+ def _(mo):
368
+ mo.md(r"""
369
+ ## Panning &amp; Zooming
370
+ """)
371
+ return
372
+
373
+
374
+ @app.cell(hide_code=True)
375
+ def _(mo):
376
+ mo.md(r"""
377
+ The movie rating scatter plot is a bit cluttered in places, making it hard to examine points in denser regions. Using the interaction techniques of _panning_ and _zooming_, we can inspect dense regions more closely.
378
+
379
+ Let's start by thinking about how we might express panning and zooming using Altair selections. What defines the "viewport" of a chart? _Axis scale domains!_
380
+
381
+ We can change the scale domains to modify the visualized range of data values. To do so interactively, we can bind an `interval` selection to scale domains with the code `bind='scales'`. The result is that instead of an interval brush that we can drag and zoom, we instead can drag and zoom the entire plotting area!
382
+
383
+ _In the chart below, click and drag to pan (translate) the view, or scroll to zoom (scale) the view. What can you discover about the precision of the provided rating values?_
384
+ """)
385
+ return
386
+
387
+
388
+ @app.cell
389
+ def _(alt, movies):
390
+ alt.Chart(movies).mark_circle().add_params(
391
+ alt.selection_interval(bind='scales')
392
+ ).encode(
393
+ x='Rotten_Tomatoes_Rating:Q',
394
+ y=alt.Y('IMDB_Rating:Q', axis=alt.Axis(minExtent=30)), # use min extent to stabilize axis title placement
395
+ tooltip=['Title:N', 'Release_Date:N', 'IMDB_Rating:Q', 'Rotten_Tomatoes_Rating:Q']
396
+ ).properties(
397
+ width=600,
398
+ height=400
399
+ )
400
+ return
401
+
402
+
403
+ @app.cell(hide_code=True)
404
+ def _(mo):
405
+ mo.md(r"""
406
+ _Zooming in, we can see that the rating values have limited precision! The Rotten Tomatoes ratings are integers, while the IMDB ratings are truncated to tenths. As a result, there is overplotting even when we zoom, with multiple movies sharing the same rating values._
407
+
408
+ Reading the code above, you may notice the code `alt.Axis(minExtent=30)` in the `y` encoding channel. The `minExtent` parameter ensures a minimum amount of space is reserved for axis ticks and labels. Why do this? When we pan and zoom, the axis labels may change and cause the axis title position to shift. By setting a minimum extent we can reduce distracting movements in the plot. _Try changing the `minExtent` value, for example setting it to zero, and then zoom out to see what happens when longer axis labels enter the view._
409
+
410
+ Altair also includes a shorthand for adding panning and zooming to a plot. Instead of directly creating a selection, you can call `.interactive()` to have Altair automatically generate an interval selection bound to the chart's scales:
411
+ """)
412
+ return
413
+
414
+
415
+ @app.cell
416
+ def _(alt, movies):
417
+ alt.Chart(movies).mark_circle().encode(
418
+ x='Rotten_Tomatoes_Rating:Q',
419
+ y=alt.Y('IMDB_Rating:Q', axis=alt.Axis(minExtent=30)), # use min extent to stabilize axis title placement
420
+ tooltip=['Title:N', 'Release_Date:N', 'IMDB_Rating:Q', 'Rotten_Tomatoes_Rating:Q']
421
+ ).properties(
422
+ width=600,
423
+ height=400
424
+ ).interactive()
425
+ return
426
+
427
+
428
+ @app.cell(hide_code=True)
429
+ def _(mo):
430
+ mo.md(r"""
431
+ By default, scale bindings for selections include both the `x` and `y` encoding channels. What if we want to limit panning and zooming along a single dimension? We can invoke `encodings=['x']` to constrain the selection to the `x` channel only:
432
+ """)
433
+ return
434
+
435
+
436
+ @app.cell
437
+ def _(alt, movies):
438
+ alt.Chart(movies).mark_circle().add_params(
439
+ alt.selection_interval(bind='scales', encodings=['x'])
440
+ ).encode(
441
+ x='Rotten_Tomatoes_Rating:Q',
442
+ y=alt.Y('IMDB_Rating:Q', axis=alt.Axis(minExtent=30)), # use min extent to stabilize axis title placement
443
+ tooltip=['Title:N', 'Release_Date:N', 'IMDB_Rating:Q', 'Rotten_Tomatoes_Rating:Q']
444
+ ).properties(
445
+ width=600,
446
+ height=400
447
+ )
448
+ return
449
+
450
+
451
+ @app.cell(hide_code=True)
452
+ def _(mo):
453
+ mo.md(r"""
454
+ _When zooming along a single axis only, the shape of the visualized data can change, potentially affecting our perception of relationships in the data. [Choosing an appropriate aspect ratio](http://vis.stanford.edu/papers/arclength-banking) is an important visualization design concern!_
455
+ """)
456
+ return
457
+
458
+
459
+ @app.cell(hide_code=True)
460
+ def _(mo):
461
+ mo.md(r"""
462
+ ## Navigation: Overview + Detail
463
+ """)
464
+ return
465
+
466
+
467
+ @app.cell(hide_code=True)
468
+ def _(mo):
469
+ mo.md(r"""
470
+ When panning and zooming, we directly adjust the "viewport" of a chart. The related navigation strategy of _overview + detail_ instead uses an overview display to show _all_ of the data, while supporting selections that pan and zoom a separate focus display.
471
+
472
+ Below we have two area charts showing a decade of price fluctuations for the S&amp;P 500 stock index. Initially both charts show the same data range. _Click and drag in the bottom overview chart to update the focus display and examine specific time spans._
473
+ """)
474
+ return
475
+
476
+
477
+ @app.cell
478
+ def _(alt, sp500):
479
+ _brush = alt.selection_interval(encodings=['x'])
480
+ _base = alt.Chart().mark_area().encode(alt.X('date:T', title=None), alt.Y('price:Q')).properties(width=700)
481
+ alt.vconcat(_base.encode(alt.X('date:T', title=None, scale=alt.Scale(domain=_brush))), _base.add_params(_brush).properties(height=60), data=sp500)
482
+ return
483
+
484
+
485
+ @app.cell(hide_code=True)
486
+ def _(mo):
487
+ mo.md(r"""
488
+ Unlike our earlier panning &amp; zooming case, here we don't want to bind a selection directly to the scales of a single interactive chart. Instead, we want to bind the selection to a scale domain in _another_ chart. To do so, we update the `x` encoding channel for our focus chart, setting the scale `domain` property to reference our `brush` selection. If no interval is defined (the selection is empty), Altair ignores the brush and uses the underlying data to determine the domain. When a brush interval is created, Altair instead uses that as the scale `domain` for the focus chart.
489
+ """)
490
+ return
491
+
492
+
493
+ @app.cell(hide_code=True)
494
+ def _(mo):
495
+ mo.md(r"""
496
+ ## Details on Demand
497
+ """)
498
+ return
499
+
500
+
501
+ @app.cell(hide_code=True)
502
+ def _(mo):
503
+ mo.md(r"""
504
+ Once we spot points of interest within a visualization, we often want to know more about them. _Details-on-demand_ refers to interactively querying for more information about selected values. _Tooltips_ are one useful means of providing details on demand. However, tooltips typically only show information for one data point at a time. How might we show more?
505
+
506
+ The movie ratings scatterplot includes a number of potentially interesting outliers where the Rotten Tomatoes and IMDB ratings disagree. Let's create a plot that allows us to interactively select points and show their labels. To trigger the filter query on either the hover or click interaction, we will use the [Altair composition operator](https://altair-viz.github.io/user_guide/interactions.html#composing-multiple-selections) `|` ("or").
507
+
508
+ _Mouse over points in the scatter plot below to see a highlight and title label. Shift-click points to make annotations persistent and view multiple labels at once. Which movies are loved by Rotten Tomatoes critics, but not the general audience on IMDB (or vice versa)? See if you can find possible errors, where two different movies with the same name were accidentally combined!_
509
+ """)
510
+ return
511
+
512
+
513
+ @app.cell
514
+ def _(alt, movies):
515
+ hover = alt.selection_point(toggle=False, on='mouseover', nearest=True, empty=False)
516
+ click = alt.selection_point(empty=False)
517
+ plot_1 = alt.Chart().mark_circle().encode(x='Rotten_Tomatoes_Rating:Q', y='IMDB_Rating:Q')
518
+ _base = plot_1.transform_filter(hover | click)
519
+ alt.layer(plot_1.add_params(hover).add_params(click), _base.mark_point(size=100, stroke='firebrick', strokeWidth=1), _base.mark_text(dx=4, dy=-8, align='right', stroke='white', strokeWidth=2).encode(text='Title:N'), _base.mark_text(dx=4, dy=-8, align='right').encode(text='Title:N'), data=movies).properties(width=600, height=450)
520
+ return
521
+
522
+
523
+ @app.cell(hide_code=True)
524
+ def _(mo):
525
+ mo.md(r"""
526
+ The example above adds three new layers to the scatter plot: a circular annotation, white text to provide a legible background, and black text showing a film title. In addition, this example uses two selections in tandem:
527
+
528
+ 1. A single selection (`hover`) that includes `nearest=True` to automatically select the nearest data point as the mouse moves.
529
+ 2. A multi selection (`click`) to create persistent selections via shift-click.
530
+
531
+ Both selections include the set `empty='none'` to indicate that no points should be included if a selection is empty. These selections are then combined into a single filter predicate &mdash; the logical _or_ of `hover` and `click` &mdash; to include points that reside in _either_ selection. We use this predicate to filter the new layers to show annotations and labels for selected points only.
532
+ """)
533
+ return
534
+
535
+
536
+ @app.cell(hide_code=True)
537
+ def _(mo):
538
+ mo.md(r"""
539
+ Using selections and layers, we can realize a number of different designs for details on demand! For example, here is a log-scaled time series of technology stock prices, annotated with a guideline and labels for the date nearest the mouse cursor:
540
+ """)
541
+ return
542
+
543
+
544
+ @app.cell
545
+ def _(alt, stocks):
546
+ # select a point for which to provide details-on-demand
547
+ label = alt.selection_point(toggle=False, encodings=['x'], on='mouseover', nearest=True, empty=False)
548
+ _base = alt.Chart().mark_line().encode(alt.X('date:T'), alt.Y('price:Q', scale=alt.Scale(type='log')), alt.Color('symbol:N')) # limit selection to x-axis value
549
+ # define our base line chart of stock prices
550
+ alt.layer(_base, alt.Chart().mark_rule(color='#aaa').encode(x='date:T').transform_filter(label), _base.mark_circle().encode(opacity=alt.condition(label, alt.value(1), alt.value(0))).add_params(label), _base.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode(text='price:Q').transform_filter(label), _base.mark_text(align='left', dx=5, dy=-5).encode(text='price:Q').transform_filter(label), data=stocks).properties(width=700, height=400) # select on mouseover events # select data point nearest the cursor # empty selection includes no data points # base line chart # add a rule mark to serve as a guide line # add circle marks for selected time points, hide unselected points # add white stroked text to provide a legible background for labels # add text labels for stock prices
551
+ return
552
+
553
+
554
+ @app.cell(hide_code=True)
555
+ def _(mo):
556
+ mo.md(r"""
557
+ _Putting into action what we've learned so far: can you modify the movie scatter plot above (the one with the dynamic query over years) to include a `rule` mark that shows the average IMDB (or Rotten Tomatoes) rating for the data contained within the year `interval` selection?_
558
+ """)
559
+ return
560
+
561
+
562
+ @app.cell(hide_code=True)
563
+ def _(mo):
564
+ mo.md(r"""
565
+ ## Brushing &amp; Linking, Revisited
566
+ """)
567
+ return
568
+
569
+
570
+ @app.cell(hide_code=True)
571
+ def _(mo):
572
+ mo.md(r"""
573
+ Earlier in this notebook we saw an example of _brushing &amp; linking_: using a dynamic query histogram to highlight points in a movie rating scatter plot. Here, we'll visit some additional examples involving linked selections.
574
+
575
+ Returning to the `cars` dataset, we can use the `repeat` operator to build a [scatter plot matrix (SPLOM)](https://en.wikipedia.org/wiki/Scatter_plot#Scatterplot_matrices) that shows associations between mileage, acceleration, and horsepower. We can define an `interval` selection and include it _within_ our repeated scatter plot specification to enable linked selections among all the plots.
576
+
577
+ _Click and drag in any of the plots below to perform brushing &amp; linking!_
578
+ """)
579
+ return
580
+
581
+
582
+ @app.cell
583
+ def _(alt, cars):
584
+ _brush = alt.selection_interval(resolve='global')
585
+ alt.Chart(cars).mark_circle().add_params(_brush).encode(alt.X(alt.repeat('column'), type='quantitative'), alt.Y(alt.repeat('row'), type='quantitative'), color=alt.condition(_brush, 'Cylinders:O', alt.value('grey')), opacity=alt.condition(_brush, alt.value(0.8), alt.value(0.1))).properties(width=140, height=140).repeat(column=['Acceleration', 'Horsepower', 'Miles_per_Gallon'], row=['Miles_per_Gallon', 'Horsepower', 'Acceleration']) # resolve all selections to a single global instance
586
+ return
587
+
588
+
589
+ @app.cell(hide_code=True)
590
+ def _(mo):
591
+ mo.md(r"""
592
+ Note above the use of `resolve='global'` on the `interval` selection. The default setting of `'global'` indicates that across all plots only one brush can be active at a time. However, in some cases we might want to define brushes in multiple plots and combine the results. If we use `resolve='union'`, the selection will be the _union_ of all brushes: if a point resides within any brush it will be selected. Alternatively, if we use `resolve='intersect'`, the selection will consist of the _intersection_ of all brushes: only points that reside within all brushes will be selected.
593
+
594
+ _Try setting the `resolve` parameter to `'union'` and `'intersect'` and see how it changes the resulting selection logic._
595
+ """)
596
+ return
597
+
598
+
599
+ @app.cell(hide_code=True)
600
+ def _(mo):
601
+ mo.md(r"""
602
+ ### Cross-Filtering
603
+
604
+ The brushing &amp; linking examples we've looked at all use conditional encodings, for example to change opacity values in response to a selection. Another option is to use a selection defined in one view to _filter_ the content of another view.
605
+
606
+ Let's build a collection of histograms for the `flights` dataset: arrival `delay` (how early or late a flight arrives, in minutes), `distance` flown (in miles), and `time` of departure (hour of the day). We'll use the `repeat` operator to create the histograms, and add an `interval` selection for the `x` axis with brushes resolved via intersection.
607
+
608
+ In particular, each histogram will consist of two layers: a gray background layer and a blue foreground layer, with the foreground layer filtered by our intersection of brush selections. The result is a _cross-filtering_ interaction across the three charts!
609
+
610
+ _Drag out brush intervals in the charts below. As you select flights with longer or shorter arrival delays, how do the distance and time distributions respond?_
611
+ """)
612
+ return
613
+
614
+
615
+ @app.cell
616
+ def _(alt, flights):
617
+ _brush = alt.selection_interval(encodings=['x'], resolve='intersect')
618
+ hist = alt.Chart().mark_bar().encode(alt.X(alt.repeat('row'), type='quantitative', bin=alt.Bin(maxbins=100, minstep=1), axis=alt.Axis(format='d', titleAnchor='start')), alt.Y('count():Q', title=None))
619
+ alt.layer(hist.add_params(_brush).encode(color=alt.value('lightgrey')), hist.transform_filter(_brush)).properties(width=900, height=100).repeat(row=['delay', 'distance', 'time'], data=flights).transform_calculate(delay='datum.delay < 180 ? datum.delay : 180', time='hours(datum.date) + minutes(datum.date) / 60').configure_view(stroke='transparent') # up to 100 bins # integer format, left-aligned title # no y-axis title # clamp delays > 3 hours # fractional hours # no outline
620
+ return
621
+
622
+
623
+ @app.cell(hide_code=True)
624
+ def _(mo):
625
+ mo.md(r"""
626
+ _By cross-filtering you can observe that delayed flights are more likely to depart at later hours. This phenomenon is familiar to frequent fliers: a delay can propagate through the day, affecting subsequent travel by that plane. For the best odds of an on-time arrival, book an early flight!_
627
+
628
+ The combination of multiple views and interactive selections can enable valuable forms of multi-dimensional reasoning, turning even basic histograms into powerful input devices for asking questions of a dataset!
629
+ """)
630
+ return
631
+
632
+
633
+ @app.cell(hide_code=True)
634
+ def _(mo):
635
+ mo.md(r"""
636
+ ## Summary
637
+ """)
638
+ return
639
+
640
+
641
+ @app.cell(hide_code=True)
642
+ def _(mo):
643
+ mo.md(r"""
644
+ For more information about the supported interaction options in Altair, please consult the [Altair interactive selection documentation](https://altair-viz.github.io/user_guide/interactions.html). For details about customizing event handlers, for example to compose multiple interaction techniques or support touch-based input on mobile devices, see the [Vega-Lite selection documentation](https://vega.github.io/vega-lite/docs/selection.html).
645
+
646
+ Interested in learning more?
647
+ - The _selection_ abstraction was introduced in the paper [Vega-Lite: A Grammar of Interactive Graphics](http://idl.cs.washington.edu/papers/vega-lite/), by Satyanarayan, Moritz, Wongsuphasawat, &amp; Heer.
648
+ - The PRIM-9 system (for projection, rotation, isolation, and masking in up to 9 dimensions) is one of the earliest interactive visualization tools, built in the early 1970s by Fisherkeller, Tukey, &amp; Friedman. [A retro demo video survives!](https://www.youtube.com/watch?v=B7XoW2qiFUA)
649
+ - The concept of brushing &amp; linking was crystallized by Becker, Cleveland, &amp; Wilks in their 1987 article [Dynamic Graphics for Data Analysis](https://scholar.google.com/scholar?cluster=14817303117298653693).
650
+ - For a comprehensive summary of interaction techniques for visualization, see [Interactive Dynamics for Visual Analysis](https://queue.acm.org/detail.cfm?id=2146416) by Heer &amp; Shneiderman.
651
+ - Finally, for a treatise on what makes interaction effective, read the classic [Direct Manipulation Interfaces](https://scholar.google.com/scholar?cluster=15702972136892195211) paper by Hutchins, Hollan, &amp; Norman.
652
+ """)
653
+ return
654
+
655
+
656
+ @app.cell(hide_code=True)
657
+ def _(mo):
658
+ mo.md(r"""
659
+ #### Appendix: On The Representation of Time
660
+
661
+ Earlier we observed a small bump in the number of movies in either 1969 and 1970. Where does that bump come from? And why 1969 _or_ 1970? The answer stems from a combination of missing data and how your computer represents time.
662
+
663
+ Internally, dates and times are represented relative to the [UNIX epoch](https://en.wikipedia.org/wiki/Unix_time), in which time "zero" corresponds to the stroke of midnight on January 1, 1970 in [UTC time](https://en.wikipedia.org/wiki/Coordinated_Universal_Time), which runs along the [prime meridian](https://en.wikipedia.org/wiki/Prime_meridian). It turns out there are a few movies with missing (`null`) release dates. Those `null` values get interpreted as time `0`, and thus map to January 1, 1970 in UTC time. If you live in the Americas &ndash; and thus in "earlier" time zones &ndash; this precise point in time corresponds to an earlier hour on December 31, 1969 in your local time zone. On the other hand, if you live near or east of the prime meridian, the date in your local time zone will be January 1, 1970.
664
+
665
+ The takeaway? Always be skeptical of your data, and be mindful that how data is represented (whether as date times, or floating point numbers, or latitudes and longitudes, _etc._) can sometimes lead to artifacts that impact analysis!
666
+ """)
667
+ return
668
+
669
+
670
+ if __name__ == "__main__":
671
+ app.run()
altair/07_cartographic.py ADDED
@@ -0,0 +1,898 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # "vega_datasets==0.9.0",
8
+ # ]
9
+ # ///
10
+
11
+ import marimo
12
+
13
+ __generated_with = "0.20.4"
14
+ app = marimo.App()
15
+
16
+
17
+ @app.cell
18
+ def _():
19
+ import marimo as mo
20
+
21
+ return (mo,)
22
+
23
+
24
+ @app.cell(hide_code=True)
25
+ def _(mo):
26
+ mo.md(r"""
27
+ # Cartographic Visualization
28
+
29
+ _“The making of maps is one of humanity's longest established intellectual endeavors and also one of its most complex, with scientific theory, graphical representation, geographical facts, and practical considerations blended together in an unending variety of ways.”_ &mdash; [H. J. Steward](https://books.google.com/books?id=cVy1Ms43fFYC)
30
+
31
+ Cartography &ndash; the study and practice of map-making &ndash; has a rich history spanning centuries of discovery and design. Cartographic visualization leverages mapping techniques to convey data containing spatial information, such as locations, routes, or trajectories on the surface of the Earth.
32
+
33
+ <div style="float: right; margin-left: 1em; margin-top: 1em;"><img width="300px" src="https://gist.githubusercontent.com/jheer/c90d582ef5322582cf4960ec7689f6f6/raw/8dc92382a837ccc34c076f4ce7dd864e7893324a/latlon.png" /></div>
34
+
35
+ Approximating the Earth as a sphere, we can denote positions using a spherical coordinate system of _latitude_ (angle in degrees north or south of the _equator_) and _longitude_ (angle in degrees specifying east-west position). In this system, a _parallel_ is a circle of constant latitude and a _meridian_ is a circle of constant longitude. The [_prime meridian_](https://en.wikipedia.org/wiki/Prime_meridian) lies at 0° longitude and by convention is defined to pass through the Royal Observatory in Greenwich, England.
36
+
37
+ To "flatten" a three-dimensional sphere on to a two-dimensional plane, we must apply a [projection](https://en.wikipedia.org/wiki/Map_projection) that maps (`longitude`, `latitude`) pairs to (`x`, `y`) coordinates. Similar to [scales](https://github.com/uwdata/visualization-curriculum/blob/master/altair_scales_axes_legends.ipynb), projections map from a data domain (spatial position) to a visual range (pixel position). However, the scale mappings we've seen thus far accept a one-dimensional domain, whereas map projections are inherently two-dimensional.
38
+
39
+ In this notebook, we will introduce the basics of creating maps and visualizing spatial data with Altair, including:
40
+
41
+ - Data formats for representing geographic features,
42
+ - Geo-visualization techniques such as point, symbol, and choropleth maps, and
43
+ - A review of common cartographic projections.
44
+
45
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
46
+ """)
47
+ return
48
+
49
+
50
+ @app.cell
51
+ def _():
52
+ import pandas as pd
53
+ import altair as alt
54
+ from vega_datasets import data
55
+
56
+ return alt, data
57
+
58
+
59
+ @app.cell(hide_code=True)
60
+ def _(mo):
61
+ mo.md(r"""
62
+ ## Geographic Data: GeoJSON and TopoJSON
63
+ """)
64
+ return
65
+
66
+
67
+ @app.cell(hide_code=True)
68
+ def _(mo):
69
+ mo.md(r"""
70
+ Up to this point, we have worked with JSON and CSV formatted datasets that correspond to data tables made up of rows (records) and columns (fields). In order to represent geographic regions (countries, states, _etc._) and trajectories (flight paths, subway lines, _etc._), we need to expand our repertoire with additional formats designed to support rich geometries.
71
+
72
+ [GeoJSON](https://en.wikipedia.org/wiki/GeoJSON) models geographic features within a specialized JSON format. A GeoJSON `feature` can include geometric data &ndash; such as `longitude`, `latitude` coordinates that make up a country boundary &ndash; as well as additional data attributes.
73
+
74
+ Here is a GeoJSON `feature` object for the boundary of the U.S. state of Colorado:
75
+ """)
76
+ return
77
+
78
+
79
+ @app.cell(hide_code=True)
80
+ def _(mo):
81
+ mo.md(r"""
82
+ ~~~ json
83
+ {
84
+ "type": "Feature",
85
+ "id": 8,
86
+ "properties": {"name": "Colorado"},
87
+ "geometry": {
88
+ "type": "Polygon",
89
+ "coordinates": [
90
+ [[-106.32056285448942,40.998675790862656],[-106.19134826714341,40.99813863734313],[-105.27607827344248,40.99813863734313],[-104.9422739227986,40.99813863734313],[-104.05212898774828,41.00136155846029],[-103.57475287338661,41.00189871197981],[-103.38093099236758,41.00189871197981],[-102.65589358559272,41.00189871197981],[-102.62000064466328,41.00189871197981],[-102.052892177978,41.00189871197981],[-102.052892177978,40.74889940428302],[-102.052892177978,40.69733266640851],[-102.052892177978,40.44003613055551],[-102.052892177978,40.3492571857556],[-102.052892177978,40.00333031918079],[-102.04930288388505,39.57414465707943],[-102.04930288388505,39.56823596836465],[-102.0457135897921,39.1331416175485],[-102.0457135897921,39.0466599009048],[-102.0457135897921,38.69751011321283],[-102.0457135897921,38.61478847120581],[-102.0457135897921,38.268861604631],[-102.0457135897921,38.262415762396685],[-102.04212429569915,37.738153927339205],[-102.04212429569915,37.64415206142214],[-102.04212429569915,37.38900413964724],[-102.04212429569915,36.99365914927603],[-103.00046581851544,37.00010499151034],[-103.08660887674611,37.00010499151034],[-104.00905745863294,36.99580776335414],[-105.15404227428235,36.995270609834606],[-105.2222388620483,36.995270609834606],[-105.7175614468747,36.99580776335414],[-106.00829426840322,36.995270609834606],[-106.47490250048605,36.99365914927603],[-107.4224761410235,37.00010499151034],[-107.48349414060355,37.00010499151034],[-108.38081766383978,36.99903068447129],[-109.04483707103458,36.99903068447129],[-109.04483707103458,37.484617466122884],[-109.04124777694163,37.88049961001363],[-109.04124777694163,38.15283644441336],[-109.05919424740635,38.49983761802722],[-109.05201565922046,39.36680339854235],[-109.05201565922046,39.49786885730673],[-109.05201565922046,39.66062637372313],[-109.05201565922046,40.22248895514744],[-109.05201565922046,40.653823231326896],[-109.05201565922046,41.000287251421234],[-107.91779872584989,41.00189871197981],[-107.3183866123281,41.00297301901887],[-106.85895696843116,41.00189871197981],[-106.32056285448942,40.998675790862656]]
91
+ ]
92
+ }
93
+ }
94
+ ~~~
95
+ """)
96
+ return
97
+
98
+
99
+ @app.cell(hide_code=True)
100
+ def _(mo):
101
+ mo.md(r"""
102
+ The `feature` includes a `properties` object, which can include any number of data fields, plus a `geometry` object, which in this case contains a single polygon that consists of `[longitude, latitude]` coordinates for the state boundary. The coordinates continue off to the right for a while should you care to scroll...
103
+
104
+ To learn more about the nitty-gritty details of GeoJSON, see the [official GeoJSON specification](http://geojson.org/) or read [Tom MacWright's helpful primer](https://macwright.org/2015/03/23/geojson-second-bite).
105
+ """)
106
+ return
107
+
108
+
109
+ @app.cell(hide_code=True)
110
+ def _(mo):
111
+ mo.md(r"""
112
+ One drawback of GeoJSON as a storage format is that it can be redundant, resulting in larger file sizes. Consider: Colorado shares boundaries with six other states (seven if you include the corner touching Arizona). Instead of using separate, overlapping coordinate lists for each of those states, a more compact approach is to encode shared borders only once, representing the _topology_ of geographic regions. Fortunately, this is precisely what the [TopoJSON](https://github.com/topojson/topojson/blob/master/README.md) format does!
113
+ """)
114
+ return
115
+
116
+
117
+ @app.cell(hide_code=True)
118
+ def _(mo):
119
+ mo.md(r"""
120
+ Let's load a TopoJSON file of world countries (at 110 meter resolution):
121
+ """)
122
+ return
123
+
124
+
125
+ @app.cell
126
+ def _(data):
127
+ world = data.world_110m.url
128
+ world
129
+ return (world,)
130
+
131
+
132
+ @app.cell
133
+ def _(data):
134
+ world_topo = data.world_110m()
135
+ return (world_topo,)
136
+
137
+
138
+ @app.cell
139
+ def _(world_topo):
140
+ world_topo.keys()
141
+ return
142
+
143
+
144
+ @app.cell
145
+ def _(world_topo):
146
+ world_topo['type']
147
+ return
148
+
149
+
150
+ @app.cell
151
+ def _(world_topo):
152
+ world_topo['objects'].keys()
153
+ return
154
+
155
+
156
+ @app.cell(hide_code=True)
157
+ def _(mo):
158
+ mo.md(r"""
159
+ _Inspect the `world_topo` TopoJSON dictionary object above to see its contents._
160
+
161
+ In the data above, the `objects` property indicates the named elements we can extract from the data: geometries for all `countries`, or a single polygon representing all `land` on Earth. Either of these can be unpacked to GeoJSON data we can then visualize.
162
+
163
+ As TopoJSON is a specialized format, we need to instruct Altair to parse the TopoJSON format, indicating which named faeture object we wish to extract from the topology. The following code indicates that we want to extract GeoJSON features from the `world` dataset for the `countries` object:
164
+
165
+ ~~~ js
166
+ alt.topo_feature(world, 'countries')
167
+ ~~~
168
+
169
+ This `alt.topo_feature` method call expands to the following Vega-Lite JSON:
170
+
171
+ ~~~ json
172
+ {
173
+ "values": world,
174
+ "format": {"type": "topojson", "feature": "countries"}
175
+ }
176
+ ~~~
177
+
178
+ Now that we can load geographic data, we're ready to start making maps!
179
+ """)
180
+ return
181
+
182
+
183
+ @app.cell(hide_code=True)
184
+ def _(mo):
185
+ mo.md(r"""
186
+ ## Geoshape Marks
187
+ """)
188
+ return
189
+
190
+
191
+ @app.cell(hide_code=True)
192
+ def _(mo):
193
+ mo.md(r"""
194
+ To visualize geographic data, Altair provides the `geoshape` mark type. To create a basic map, we can create a `geoshape` mark and pass it our TopoJSON data, which is then unpacked into GeoJSON features, one for each country of the world:
195
+ """)
196
+ return
197
+
198
+
199
+ @app.cell
200
+ def _(alt, world):
201
+ alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape()
202
+ return
203
+
204
+
205
+ @app.cell(hide_code=True)
206
+ def _(mo):
207
+ mo.md(r"""
208
+ In the example above, Altair applies a default blue color and uses a default map projection (`mercator`). We can customize the colors and boundary stroke widths using standard mark properties. Using the `project` method we can also add our own map projection:
209
+ """)
210
+ return
211
+
212
+
213
+ @app.cell
214
+ def _(alt, world):
215
+ alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
216
+ fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
217
+ ).project(
218
+ type='mercator'
219
+ )
220
+ return
221
+
222
+
223
+ @app.cell(hide_code=True)
224
+ def _(mo):
225
+ mo.md(r"""
226
+ By default Altair automatically adjusts the projection so that all the data fits within the width and height of the chart. We can also specify projection parameters, such as `scale` (zoom level) and `translate` (panning), to customize the projection settings. Here we adjust the `scale` and `translate` parameters to focus on Europe:
227
+ """)
228
+ return
229
+
230
+
231
+ @app.cell
232
+ def _(alt, world):
233
+ alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
234
+ fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
235
+ ).project(
236
+ type='mercator', scale=400, translate=[100, 550]
237
+ )
238
+ return
239
+
240
+
241
+ @app.cell(hide_code=True)
242
+ def _(mo):
243
+ mo.md(r"""
244
+ _Note how the 110m resolution of the data becomes apparent at this scale. To see more detailed coast lines and boundaries, we need an input file with more fine-grained geometries. Adjust the `scale` and `translate` parameters to focus the map on other regions!_
245
+ """)
246
+ return
247
+
248
+
249
+ @app.cell(hide_code=True)
250
+ def _(mo):
251
+ mo.md(r"""
252
+ So far our map shows countries only. Using the `layer` operator, we can combine multiple map elements. Altair includes _data generators_ we can use to create data for additional map layers:
253
+
254
+ - The sphere generator (`{'sphere': True}`) provides a GeoJSON representation of the full sphere of the Earth. We can create an additional `geoshape` mark that fills in the shape of the Earth as a background layer.
255
+ - The graticule generator (`{'graticule': ...}`) creates a GeoJSON feature representing a _graticule_: a grid formed by lines of latitude and longitude. The default graticule has meridians and parallels every 10° between ±80° latitude. For the polar regions, there are meridians every 90°. These settings can be customized using the `stepMinor` and `stepMajor` properties.
256
+
257
+ Let's layer sphere, graticule, and country marks into a reusable map specification:
258
+ """)
259
+ return
260
+
261
+
262
+ @app.cell
263
+ def _(alt, world):
264
+ map = alt.layer(
265
+ # use the sphere of the Earth as the base layer
266
+ alt.Chart({'sphere': True}).mark_geoshape(
267
+ fill='#e6f3ff'
268
+ ),
269
+ # add a graticule for geographic reference lines
270
+ alt.Chart({'graticule': True}).mark_geoshape(
271
+ stroke='#ffffff', strokeWidth=1
272
+ ),
273
+ # and then the countries of the world
274
+ alt.Chart(alt.topo_feature(world, 'countries')).mark_geoshape(
275
+ fill='#2a1d0c', stroke='#706545', strokeWidth=0.5
276
+ )
277
+ ).properties(
278
+ width=600,
279
+ height=400
280
+ )
281
+ return (map,)
282
+
283
+
284
+ @app.cell(hide_code=True)
285
+ def _(mo):
286
+ mo.md(r"""
287
+ We can extend the map with a desired projection and draw the result. Here we apply a [Natural Earth projection](https://en.wikipedia.org/wiki/Natural_Earth_projection). The _sphere_ layer provides the light blue background; the _graticule_ layer provides the white geographic reference lines.
288
+ """)
289
+ return
290
+
291
+
292
+ @app.cell
293
+ def _(map):
294
+ map.project(
295
+ type='naturalEarth1', scale=110, translate=[300, 200]
296
+ ).configure_view(stroke=None)
297
+ return
298
+
299
+
300
+ @app.cell(hide_code=True)
301
+ def _(mo):
302
+ mo.md(r"""
303
+ ## Point Maps
304
+
305
+ In addition to the _geometric_ data provided by GeoJSON or TopoJSON files, many tabular datasets include geographic information in the form of fields for `longitude` and `latitude` coordinates, or references to geographic regions such as country names, state names, postal codes, _etc._, which can be mapped to coordinates using a [geocoding service](https://en.wikipedia.org/wiki/Geocoding). In some cases, location data is rich enough that we can see meaningful patterns by projecting the data points alone &mdash; no base map required!
306
+
307
+ Let's look at a dataset of 5-digit zip codes in the United States, including `longitude`, `latitude` coordinates for each post office in addition to a `zip_code` field.
308
+ """)
309
+ return
310
+
311
+
312
+ @app.cell
313
+ def _(data):
314
+ zipcodes = data.zipcodes.url
315
+ zipcodes
316
+ return (zipcodes,)
317
+
318
+
319
+ @app.cell(hide_code=True)
320
+ def _(mo):
321
+ mo.md(r"""
322
+ We can visualize each post office location using a small (1-pixel) `square` mark. However, to set the positions we do _not_ use `x` and `y` channels. _Why is that?_
323
+
324
+ While cartographic projections map (`longitude`, `latitude`) coordinates to (`x`, `y`) coordinates, they can do so in arbitrary ways. There is no guarantee, for example, that `longitude` → `x` and `latitude` → `y`! Instead, Altair includes special `longitude` and `latitude` encoding channels to handle geographic coordinates. These channels indicate which data fields should be mapped to `longitude` and `latitude` coordinates, and then applies a projection to map those coordinates to (`x`, `y`) positions.
325
+ """)
326
+ return
327
+
328
+
329
+ @app.cell
330
+ def _(alt, zipcodes):
331
+ alt.Chart(zipcodes).mark_square(
332
+ size=1, opacity=1
333
+ ).encode(
334
+ longitude='longitude:Q', # apply the field named 'longitude' to the longitude channel
335
+ latitude='latitude:Q' # apply the field named 'latitude' to the latitude channel
336
+ ).project(
337
+ type='albersUsa'
338
+ ).properties(
339
+ width=900,
340
+ height=500
341
+ ).configure_view(
342
+ stroke=None
343
+ )
344
+ return
345
+
346
+
347
+ @app.cell(hide_code=True)
348
+ def _(mo):
349
+ mo.md(r"""
350
+ _Plotting zip codes only, we can see the outline of the United States and discern meaningful patterns in the density of post offices, without a base map or additional reference elements!_
351
+
352
+ We use the `albersUsa` projection, which takes some liberties with the actual geometry of the Earth, with scaled versions of Alaska and Hawaii in the bottom-left corner. As we did not specify projection `scale` or `translate` parameters, Altair sets them automatically to fit the visualized data.
353
+
354
+ We can now go on to ask more questions of our dataset. For example, is there any rhyme or reason to the allocation of zip codes? To assess this question we can add a color encoding based on the first digit of the zip code. We first add a `calculate` transform to extract the first digit, and encode the result using the color channel:
355
+ """)
356
+ return
357
+
358
+
359
+ @app.cell
360
+ def _(alt, zipcodes):
361
+ alt.Chart(zipcodes).transform_calculate(
362
+ digit='datum.zip_code[0]'
363
+ ).mark_square(
364
+ size=2, opacity=1
365
+ ).encode(
366
+ longitude='longitude:Q',
367
+ latitude='latitude:Q',
368
+ color='digit:N'
369
+ ).project(
370
+ type='albersUsa'
371
+ ).properties(
372
+ width=900,
373
+ height=500
374
+ ).configure_view(
375
+ stroke=None
376
+ )
377
+ return
378
+
379
+
380
+ @app.cell(hide_code=True)
381
+ def _(mo):
382
+ mo.md(r"""
383
+ _To zoom in on a specific digit, add a filter transform to limit the data shown! Try adding an [interactive selection](https://github.com/uwdata/visualization-curriculum/blob/master/altair_interaction.ipynb) to filter to a single digit and dynamically update the map. And be sure to use strings (\`'1'\`) instead of numbers (\`1\`) when filtering digit values!_
384
+
385
+ (This example is inspired by Ben Fry's classic [zipdecode](https://benfry.com/zipdecode/) visualization!)
386
+
387
+ We might further wonder what the _sequence_ of zip codes might indicate. One way to explore this question is to connect each consecutive zip code using a `line` mark, as done in Robert Kosara's [ZipScribble](https://eagereyes.org/zipscribble-maps/united-states) visualization:
388
+ """)
389
+ return
390
+
391
+
392
+ @app.cell
393
+ def _(alt, zipcodes):
394
+ alt.Chart(zipcodes).transform_filter(
395
+ '-150 < datum.longitude && 22 < datum.latitude && datum.latitude < 55'
396
+ ).transform_calculate(
397
+ digit='datum.zip_code[0]'
398
+ ).mark_line(
399
+ strokeWidth=0.5
400
+ ).encode(
401
+ longitude='longitude:Q',
402
+ latitude='latitude:Q',
403
+ color='digit:N',
404
+ order='zip_code:O'
405
+ ).project(
406
+ type='albersUsa'
407
+ ).properties(
408
+ width=900,
409
+ height=500
410
+ ).configure_view(
411
+ stroke=None
412
+ )
413
+ return
414
+
415
+
416
+ @app.cell(hide_code=True)
417
+ def _(mo):
418
+ mo.md(r"""
419
+ _We can now see how zip codes further cluster into smaller areas, indicating a hierarchical allocation of codes by location, but with some notable variability within local clusters._
420
+
421
+ If you were paying careful attention to our earlier maps, you may have noticed that there are zip codes being plotted in the upper-left corner! These correspond to locations such as Puerto Rico or American Samoa, which contain U.S. zip codes but are mapped to `null` coordinates (`0`, `0`) by the `albersUsa` projection. In addition, Alaska and Hawaii can complicate our view of the connecting line segments. In response, the code above includes an additional filter that removes points outside our chosen `longitude` and `latitude` spans.
422
+
423
+ _Remove the filter above to see what happens!_
424
+ """)
425
+ return
426
+
427
+
428
+ @app.cell(hide_code=True)
429
+ def _(mo):
430
+ mo.md(r"""
431
+ ## Symbol Maps
432
+ """)
433
+ return
434
+
435
+
436
+ @app.cell(hide_code=True)
437
+ def _(mo):
438
+ mo.md(r"""
439
+ Now let's combine a base map and plotted data as separate layers. We'll examine the U.S. commercial flight network, considering both airports and flight routes. To do so, we'll need three datasets.
440
+ For our base map, we'll use a TopoJSON file for the United States at 10m resolution, containing features for `states` or `counties`:
441
+ """)
442
+ return
443
+
444
+
445
+ @app.cell
446
+ def _(data):
447
+ usa = data.us_10m.url
448
+ usa
449
+ return (usa,)
450
+
451
+
452
+ @app.cell(hide_code=True)
453
+ def _(mo):
454
+ mo.md(r"""
455
+ For the airports, we will use a dataset with fields for the `longitude` and `latitude` coordinates of each airport as well as the `iata` airport code &mdash; for example, `'SEA'` for [Seattle-Tacoma International Airport](https://en.wikipedia.org/wiki/Seattle%E2%80%93Tacoma_International_Airport).
456
+ """)
457
+ return
458
+
459
+
460
+ @app.cell
461
+ def _(data):
462
+ airports = data.airports.url
463
+ airports
464
+ return (airports,)
465
+
466
+
467
+ @app.cell(hide_code=True)
468
+ def _(mo):
469
+ mo.md(r"""
470
+ Finally, we will use a dataset of flight routes, which contains `origin` and `destination` fields with the IATA codes for the corresponding airports:
471
+ """)
472
+ return
473
+
474
+
475
+ @app.cell
476
+ def _(data):
477
+ flights = data.flights_airport.url
478
+ flights
479
+ return (flights,)
480
+
481
+
482
+ @app.cell(hide_code=True)
483
+ def _(mo):
484
+ mo.md(r"""
485
+ Let's start by creating a base map using the `albersUsa` projection, and add a layer that plots `circle` marks for each airport:
486
+ """)
487
+ return
488
+
489
+
490
+ @app.cell
491
+ def _(airports, alt, usa):
492
+ alt.layer(
493
+ alt.Chart(alt.topo_feature(usa, 'states')).mark_geoshape(
494
+ fill='#ddd', stroke='#fff', strokeWidth=1
495
+ ),
496
+ alt.Chart(airports).mark_circle(size=9).encode(
497
+ latitude='latitude:Q',
498
+ longitude='longitude:Q',
499
+ tooltip='iata:N'
500
+ )
501
+ ).project(
502
+ type='albersUsa'
503
+ ).properties(
504
+ width=900,
505
+ height=500
506
+ ).configure_view(
507
+ stroke=None
508
+ )
509
+ return
510
+
511
+
512
+ @app.cell(hide_code=True)
513
+ def _(mo):
514
+ mo.md(r"""
515
+ _That's a lot of airports! Obviously, not all of them are major hubs._
516
+
517
+ Similar to our zip codes dataset, our airport data includes points that lie outside the continental United States. So we again see points in the upper-left corner. We might want to filter these points, but to do so we first need to know more about them.
518
+
519
+ _Update the map projection above to `albers` &ndash; side-stepping the idiosyncratic behavior of `albersUsa` &ndash; so that the actual locations of these additional points is revealed!_
520
+
521
+ Now, instead of showing all airports in an undifferentiated fashion, let's identify major hubs by considering the total number of routes that originate at each airport. We'll use the `routes` dataset as our primary data source: it contains a list of flight routes that we can aggregate to count the number of routes for each `origin` airport.
522
+
523
+ However, the `routes` dataset does not include the _locations_ of the airports! To augment the `routes` data with locations, we need a new data transformation: `lookup`. The `lookup` transform takes a field value in a primary dataset and uses it as a _key_ to look up related information in another table. In this case, we want to match the `origin` airport code in our `routes` dataset against the `iata` field of the `airports` dataset, then extract the corresponding `latitude` and `longitude` fields.
524
+ """)
525
+ return
526
+
527
+
528
+ @app.cell
529
+ def _(airports, alt, flights, usa):
530
+ alt.layer(
531
+ alt.Chart(alt.topo_feature(usa, 'states')).mark_geoshape(
532
+ fill='#ddd', stroke='#fff', strokeWidth=1
533
+ ),
534
+ alt.Chart(flights).mark_circle().transform_aggregate(
535
+ groupby=['origin'],
536
+ routes='count()'
537
+ ).transform_lookup(
538
+ lookup='origin',
539
+ from_=alt.LookupData(data=airports, key='iata',
540
+ fields=['state', 'latitude', 'longitude'])
541
+ ).transform_filter(
542
+ 'datum.state !== "PR" && datum.state !== "VI"'
543
+ ).encode(
544
+ latitude='latitude:Q',
545
+ longitude='longitude:Q',
546
+ tooltip=['origin:N', 'routes:Q'],
547
+ size=alt.Size('routes:Q', scale=alt.Scale(range=[0, 1000]), legend=None),
548
+ order=alt.Order('routes:Q', sort='descending')
549
+ )
550
+ ).project(
551
+ type='albersUsa'
552
+ ).properties(
553
+ width=900,
554
+ height=500
555
+ ).configure_view(
556
+ stroke=None
557
+ )
558
+ return
559
+
560
+
561
+ @app.cell(hide_code=True)
562
+ def _(mo):
563
+ mo.md(r"""
564
+ _Which U.S. airports have the highest number of outgoing routes?_
565
+
566
+ Now that we can see the airports, which may wish to interact with them to better understand the structure of the air traffic network. We can add a `rule` mark layer to represent paths from `origin` airports to `destination` airports, which requires two `lookup` transforms to retreive coordinates for each end point. In addition, we can use a `single` selection to filter these routes, such that only the routes originating at the currently selected airport are shown.
567
+
568
+ _Starting from the static map above, can you build an interactive version? Feel free to skip the code below to engage with the interactive map first, and think through how you might build it on your own!_
569
+ """)
570
+ return
571
+
572
+
573
+ @app.cell
574
+ def _(airports, alt, flights, usa):
575
+ # interactive selection for origin airport
576
+ # select nearest airport to mouse cursor
577
+ origin = alt.selection_point(
578
+ on='mouseover', nearest=True,
579
+ fields=['origin'], empty='none'
580
+ )
581
+
582
+ # shared data reference for lookup transforms
583
+ foreign = alt.LookupData(data=airports, key='iata',
584
+ fields=['latitude', 'longitude'])
585
+
586
+ alt.layer(
587
+ # base map of the United States
588
+ alt.Chart(alt.topo_feature(usa, 'states')).mark_geoshape(
589
+ fill='#ddd', stroke='#fff', strokeWidth=1
590
+ ),
591
+ # route lines from selected origin airport to destination airports
592
+ alt.Chart(flights).mark_rule(
593
+ color='#000', opacity=0.35
594
+ ).transform_filter(
595
+ origin # filter to selected origin only
596
+ ).transform_lookup(
597
+ lookup='origin', from_=foreign # origin lat/lon
598
+ ).transform_lookup(
599
+ lookup='destination', from_=foreign, as_=['lat2', 'lon2'] # dest lat/lon
600
+ ).encode(
601
+ latitude='latitude:Q',
602
+ longitude='longitude:Q',
603
+ latitude2='lat2',
604
+ longitude2='lon2',
605
+ ),
606
+ # size airports by number of outgoing routes
607
+ # 1. aggregate flights-airport data set
608
+ # 2. lookup location data from airports data set
609
+ # 3. remove Puerto Rico (PR) and Virgin Islands (VI)
610
+ alt.Chart(flights).mark_circle().transform_aggregate(
611
+ groupby=['origin'],
612
+ routes='count()'
613
+ ).transform_lookup(
614
+ lookup='origin',
615
+ from_=alt.LookupData(data=airports, key='iata',
616
+ fields=['state', 'latitude', 'longitude'])
617
+ ).transform_filter(
618
+ 'datum.state !== "PR" && datum.state !== "VI"'
619
+ ).add_params(
620
+ origin
621
+ ).encode(
622
+ latitude='latitude:Q',
623
+ longitude='longitude:Q',
624
+ tooltip=['origin:N', 'routes:Q'],
625
+ size=alt.Size('routes:Q', scale=alt.Scale(range=[0, 1000]), legend=None),
626
+ order=alt.Order('routes:Q', sort='descending') # place smaller circles on top
627
+ )
628
+ ).project(
629
+ type='albersUsa'
630
+ ).properties(
631
+ width=900,
632
+ height=500
633
+ ).configure_view(
634
+ stroke=None
635
+ )
636
+ return
637
+
638
+
639
+ @app.cell(hide_code=True)
640
+ def _(mo):
641
+ mo.md(r"""
642
+ _Mouseover the map to probe the flight network!_
643
+ """)
644
+ return
645
+
646
+
647
+ @app.cell(hide_code=True)
648
+ def _(mo):
649
+ mo.md(r"""
650
+ ## Choropleth Maps
651
+ """)
652
+ return
653
+
654
+
655
+ @app.cell(hide_code=True)
656
+ def _(mo):
657
+ mo.md(r"""
658
+ A [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) uses shaded or textured regions to visualize data values. Sized symbol maps are often more accurate to read, as people tend to be better at estimating proportional differences between the area of circles than between color shades. Nevertheless, choropleth maps are popular in practice and particularly useful when too many symbols become perceptually overwhelming.
659
+
660
+ For example, while the United States only has 50 states, it has thousands of counties within those states. Let's build a choropleth map of the unemployment rate per county, back in the recession year of 2008. In some cases, input GeoJSON or TopoJSON files might include statistical data that we can directly visualize. In this case, however, we have two files: our TopoJSON file that includes county boundary features (`usa`), and a separate text file that contains unemployment statistics:
661
+ """)
662
+ return
663
+
664
+
665
+ @app.cell
666
+ def _(data):
667
+ unemp = data.unemployment.url
668
+ unemp
669
+ return (unemp,)
670
+
671
+
672
+ @app.cell(hide_code=True)
673
+ def _(mo):
674
+ mo.md(r"""
675
+ To integrate our data sources, we will again need to use the `lookup` transform, augmenting our TopoJSON-based `geoshape` data with unemployment rates. We can then create a map that includes a `color` encoding for the looked-up `rate` field.
676
+ """)
677
+ return
678
+
679
+
680
+ @app.cell
681
+ def _(alt, unemp, usa):
682
+ alt.Chart(alt.topo_feature(usa, 'counties')).mark_geoshape(
683
+ stroke='#aaa', strokeWidth=0.25
684
+ ).transform_lookup(
685
+ lookup='id', from_=alt.LookupData(data=unemp, key='id', fields=['rate'])
686
+ ).encode(
687
+ alt.Color('rate:Q',
688
+ scale=alt.Scale(domain=[0, 0.3], clamp=True),
689
+ legend=alt.Legend(format='%')),
690
+ alt.Tooltip('rate:Q', format='.0%')
691
+ ).project(
692
+ type='albersUsa'
693
+ ).properties(
694
+ width=900,
695
+ height=500
696
+ ).configure_view(
697
+ stroke=None
698
+ )
699
+ return
700
+
701
+
702
+ @app.cell(hide_code=True)
703
+ def _(mo):
704
+ mo.md(r"""
705
+ *Examine the unemployment rates by county. Higher values in Michigan may relate to the automotive industry. Counties in the [Great Plains](https://en.wikipedia.org/wiki/Great_Plains) and Mountain states exhibit both low **and** high rates. Is this variation meaningful, or is it possibly an [artifact of lower sample sizes](https://medium.com/@uwdata/surprise-maps-showing-the-unexpected-e92b67398865)? To explore further, try changing the upper scale domain (e.g., to `0.2`) to adjust the color mapping.*
706
+
707
+ A central concern for choropleth maps is the choice of colors. Above, we use Altair's default `'yellowgreenblue'` scheme for heatmaps. Below we compare other schemes, including a _single-hue sequential_ scheme (`teals`) that varies in luminance only, a _multi-hue sequential_ scheme (`viridis`) that ramps in both luminance and hue, and a _diverging_ scheme (`blueorange`) that uses a white mid-point:
708
+ """)
709
+ return
710
+
711
+
712
+ @app.cell
713
+ def _(alt, unemp, usa):
714
+ # utility function to generate a map specification for a provided color scheme
715
+ def map_(scheme):
716
+ return alt.Chart().mark_geoshape().project(type='albersUsa').encode(
717
+ alt.Color('rate:Q', scale=alt.Scale(scheme=scheme), legend=None)
718
+ ).properties(width=305, height=200)
719
+
720
+ alt.hconcat(
721
+ map_('tealblues'), map_('viridis'), map_('blueorange'),
722
+ data=alt.topo_feature(usa, 'counties')
723
+ ).transform_lookup(
724
+ lookup='id', from_=alt.LookupData(data=unemp, key='id', fields=['rate'])
725
+ ).configure_view(
726
+ stroke=None
727
+ ).resolve_scale(
728
+ color='independent'
729
+ )
730
+ return
731
+
732
+
733
+ @app.cell(hide_code=True)
734
+ def _(mo):
735
+ mo.md(r"""
736
+ _Which color schemes do you find to be more effective? Why might that be? Modify the maps above to use other available schemes, as described in the [Vega Color Schemes documentation](https://vega.github.io/vega/docs/schemes/)._
737
+ """)
738
+ return
739
+
740
+
741
+ @app.cell(hide_code=True)
742
+ def _(mo):
743
+ mo.md(r"""
744
+ ## Cartographic Projections
745
+ """)
746
+ return
747
+
748
+
749
+ @app.cell(hide_code=True)
750
+ def _(mo):
751
+ mo.md(r"""
752
+ Now that we have some experience creating maps, let's take a closer look at cartographic projections. As explained by [Wikipedia](https://en.wikipedia.org/wiki/Map_projection),
753
+
754
+ > _All map projections necessarily distort the surface in some fashion. Depending on the purpose of the map, some distortions are acceptable and others are not; therefore, different map projections exist in order to preserve some properties of the sphere-like body at the expense of other properties._
755
+
756
+ Some of the properties we might wish to consider include:
757
+
758
+ - _Area_: Does the projection distort region sizes?
759
+ - _Bearing_: Does a straight line correspond to a constant direction of travel?
760
+ - _Distance_: Do lines of equal length correspond to equal distances on the globe?
761
+ - _Shape_: Does the projection preserve spatial relations (angles) between points?
762
+
763
+ Selecting an appropriate projection thus depends on the use case for the map. For example, if we are assessing land use and the extent of land matters, we might choose an area-preserving projection. If we want to visualize shockwaves emanating from an earthquake, we might focus the map on the quake's epicenter and preserve distances outward from that point. Or, if we wish to aid navigation, the preservation of bearing and shape may be more important.
764
+
765
+ We can also characterize projections in terms of the _projection surface_. Cylindrical projections, for example, project surface points of the sphere onto a surrounding cylinder; the "unrolled" cylinder then provides our map. As we further describe below, we might alternatively project onto the surface of a cone (conic projections) or directly onto a flat plane (azimuthal projections).
766
+
767
+ *Let's first build up our intuition by interacting with a variety of projections! **[Open the online Vega-Lite Cartographic Projections notebook](https://observablehq.com/@vega/vega-lite-cartographic-projections).** Use the controls on that page to select a projection and explore projection parameters, such as the `scale` (zooming) and x/y translation (panning). The rotation ([yaw, pitch, roll](https://en.wikipedia.org/wiki/Aircraft_principal_axes)) controls determine the orientation of the globe relative to the surface being projected upon.*
768
+ """)
769
+ return
770
+
771
+
772
+ @app.cell(hide_code=True)
773
+ def _(mo):
774
+ mo.md(r"""
775
+ ### A Tour of Specific Projection Types
776
+ """)
777
+ return
778
+
779
+
780
+ @app.cell(hide_code=True)
781
+ def _(mo):
782
+ mo.md(r"""
783
+ [**Cylindrical projections**](https://en.wikipedia.org/wiki/Map_projection#Cylindrical) map the sphere onto a surrounding cylinder, then unroll the cylinder. If the major axis of the cylinder is oriented north-south, meridians are mapped to straight lines. [Pseudo-cylindrical](https://en.wikipedia.org/wiki/Map_projection#Pseudocylindrical) projections represent a central meridian as a straight line, with other meridians "bending" away from the center.
784
+ """)
785
+ return
786
+
787
+
788
+ @app.cell
789
+ def _(alt, map):
790
+ _minimap = map.properties(width=225, height=225)
791
+ alt.hconcat(_minimap.project(type='equirectangular').properties(title='equirectangular'), _minimap.project(type='mercator').properties(title='mercator'), _minimap.project(type='transverseMercator').properties(title='transverseMercator'), _minimap.project(type='naturalEarth1').properties(title='naturalEarth1')).properties(spacing=10).configure_view(stroke=None)
792
+ return
793
+
794
+
795
+ @app.cell(hide_code=True)
796
+ def _(mo):
797
+ mo.md(r"""
798
+ - [Equirectangular](https://en.wikipedia.org/wiki/Equirectangular_projection) (`equirectangular`): Scale `lat`, `lon` coordinate values directly.
799
+ - [Mercator](https://en.wikipedia.org/wiki/Mercator_projection) (`mercator`): Project onto a cylinder, using `lon` directly, but subjecting `lat` to a non-linear transformation. Straight lines preserve constant compass bearings ([rhumb lines](https://en.wikipedia.org/wiki/Rhumb_line)), making this projection well-suited to navigation. However, areas in the far north or south can be greatly distorted.
800
+ - [Transverse Mercator](https://en.wikipedia.org/wiki/Transverse_Mercator_projection) (`transverseMercator`): A mercator projection, but with the bounding cylinder rotated to a transverse axis. Whereas the standard Mercator projection has highest accuracy along the equator, the Transverse Mercator projection is most accurate along the central meridian.
801
+ - [Natural Earth](https://en.wikipedia.org/wiki/Natural_Earth_projection) (`naturalEarth1`): A pseudo-cylindrical projection designed for showing the whole Earth in one view.
802
+ <br/><br/>
803
+ """)
804
+ return
805
+
806
+
807
+ @app.cell(hide_code=True)
808
+ def _(mo):
809
+ mo.md(r"""
810
+ [**Conic projections**](https://en.wikipedia.org/wiki/Map_projection#Conic) map the sphere onto a cone, and then unroll the cone on to the plane. Conic projections are configured by two _standard parallels_, which determine where the cone intersects the globe.
811
+ """)
812
+ return
813
+
814
+
815
+ @app.cell
816
+ def _(alt, map):
817
+ _minimap = map.properties(width=180, height=130)
818
+ alt.hconcat(_minimap.project(type='conicEqualArea').properties(title='conicEqualArea'), _minimap.project(type='conicEquidistant').properties(title='conicEquidistant'), _minimap.project(type='conicConformal', scale=35, translate=[90, 65]).properties(title='conicConformal'), _minimap.project(type='albers').properties(title='albers'), _minimap.project(type='albersUsa').properties(title='albersUsa')).properties(spacing=10).configure_view(stroke=None)
819
+ return
820
+
821
+
822
+ @app.cell(hide_code=True)
823
+ def _(mo):
824
+ mo.md(r"""
825
+ - [Conic Equal Area](https://en.wikipedia.org/wiki/Albers_projection) (`conicEqualArea`): Area-preserving conic projection. Shape and distance are not preserved, but roughly accurate within standard parallels.
826
+ - [Conic Equidistant](https://en.wikipedia.org/wiki/Equidistant_conic_projection) (`conicEquidistant`): Conic projection that preserves distance along the meridians and standard parallels.
827
+ - [Conic Conformal](https://en.wikipedia.org/wiki/Lambert_conformal_conic_projection) (`conicConformal`): Conic projection that preserves shape (local angles), but not area or distance.
828
+ - [Albers](https://en.wikipedia.org/wiki/Albers_projection) (`albers`): A variant of the conic equal area projection with standard parallels optimized for creating maps of the United States.
829
+ - [Albers USA](https://en.wikipedia.org/wiki/Albers_projection) (`albersUsa`): A hybrid projection for the 50 states of the United States of America. This projection stitches together three Albers projections with different parameters for the continental U.S., Alaska, and Hawaii.
830
+ <br/><br/>
831
+ """)
832
+ return
833
+
834
+
835
+ @app.cell(hide_code=True)
836
+ def _(mo):
837
+ mo.md(r"""
838
+ [**Azimuthal projections**](https://en.wikipedia.org/wiki/Map_projection#Azimuthal_%28projections_onto_a_plane%29) map the sphere directly onto a plane.
839
+ """)
840
+ return
841
+
842
+
843
+ @app.cell
844
+ def _(alt, map):
845
+ _minimap = map.properties(width=180, height=180)
846
+ alt.hconcat(_minimap.project(type='azimuthalEqualArea').properties(title='azimuthalEqualArea'), _minimap.project(type='azimuthalEquidistant').properties(title='azimuthalEquidistant'), _minimap.project(type='orthographic').properties(title='orthographic'), _minimap.project(type='stereographic').properties(title='stereographic'), _minimap.project(type='gnomonic').properties(title='gnomonic')).properties(spacing=10).configure_view(stroke=None)
847
+ return
848
+
849
+
850
+ @app.cell(hide_code=True)
851
+ def _(mo):
852
+ mo.md(r"""
853
+ - [Azimuthal Equal Area](https://en.wikipedia.org/wiki/Lambert_azimuthal_equal-area_projection) (`azimuthalEqualArea`): Accurately projects area in all parts of the globe, but does not preserve shape (local angles).
854
+ - [Azimuthal Equidistant](https://en.wikipedia.org/wiki/Azimuthal_equidistant_projection) (`azimuthalEquidistant`): Preserves proportional distance from the projection center to all other points on the globe.
855
+ - [Orthographic](https://en.wikipedia.org/wiki/Orthographic_projection_in_cartography) (`orthographic`): Projects a visible hemisphere onto a distant plane. Approximately matches a view of the Earth from outer space.
856
+ - [Stereographic](https://en.wikipedia.org/wiki/Stereographic_projection) (`stereographic`): Preserves shape, but not area or distance.
857
+ - [Gnomonic](https://en.wikipedia.org/wiki/Gnomonic_projection) (`gnomonic`): Projects the surface of the sphere directly onto a tangent plane. [Great circles](https://en.wikipedia.org/wiki/Great_circle) around the Earth are projected to straight lines, showing the shortest path between points.
858
+ <br/><br/>
859
+ """)
860
+ return
861
+
862
+
863
+ @app.cell(hide_code=True)
864
+ def _(mo):
865
+ mo.md(r"""
866
+ ## Coda: Wrangling Geographic Data
867
+ """)
868
+ return
869
+
870
+
871
+ @app.cell(hide_code=True)
872
+ def _(mo):
873
+ mo.md(r"""
874
+ The examples above all draw from the vega-datasets collection, including geometric (TopoJSON) and tabular (airports, unemployment rates) data. A common challenge to getting starting with geographic visualization is collecting the necessary data for your task. A number of data providers abound, including services such as the [United States Geological Survey](https://www.usgs.gov/products/data/all-data) and [U.S. Census Bureau](https://www.census.gov/data/datasets.html).
875
+
876
+ In many cases you may have existing data with a geographic component, but require additional measures or geometry. To help you get started, here is one workflow:
877
+
878
+ 1. Visit [Natural Earth Data](http://www.naturalearthdata.com/downloads/) and browse to select data for regions and resolutions of interest. Download the corresponding zip file(s).
879
+ 2. Go to [MapShaper](https://mapshaper.org/) and drop your downloaded zip file onto the page. Revise the data as desired, and then "Export" generated TopoJSON or GeoJSON files.
880
+ 3. Load the exported data from MapShaper for use with Altair!
881
+
882
+ Of course, many other tools &ndash; both open-source and proprietary &ndash; exist for working with geographic data. For more about geo-data wrangling and map creation, see Mike Bostock's tutorial series on [Command-Line Cartography](https://medium.com/@mbostock/command-line-cartography-part-1-897aa8f8ca2c).
883
+ """)
884
+ return
885
+
886
+
887
+ @app.cell(hide_code=True)
888
+ def _(mo):
889
+ mo.md(r"""
890
+ ## Summary
891
+
892
+ At this point, we've only dipped our toes into the waters of map-making. _(You didn't expect a single notebook to impart centuries of learning, did you?)_ For example, we left untouched topics such as [_cartograms_](https://en.wikipedia.org/wiki/Cartogram) and conveying [_topography_](https://en.wikipedia.org/wiki/Topography) &mdash; as in Imhof's illuminating book [_Cartographic Relief Presentation_](https://books.google.com/books?id=cVy1Ms43fFYC). Nevertheless, you should now be well-equipped to create a rich array of geo-visualizations. For more, MacEachren's book [_How Maps Work: Representation, Visualization, and Design_](https://books.google.com/books?id=xhAvN3B0CkUC) provides a valuable overview of map-making from the perspective of data visualization.
893
+ """)
894
+ return
895
+
896
+
897
+ if __name__ == "__main__":
898
+ app.run()
altair/08_debugging.py ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "altair==6.0.0",
5
+ # "marimo",
6
+ # "pandas==3.0.1",
7
+ # "vega_datasets==0.9.0",
8
+ # ]
9
+ # ///
10
+
11
+ import marimo
12
+
13
+ __generated_with = "0.20.4"
14
+ app = marimo.App()
15
+
16
+
17
+ @app.cell
18
+ def _():
19
+ import marimo as mo
20
+
21
+ return (mo,)
22
+
23
+
24
+ @app.cell(hide_code=True)
25
+ def _(mo):
26
+ mo.md(r"""
27
+ # Altair Debugging Guide
28
+
29
+ In this notebook we show you common debugging techniques that you can use if you run into issues with Altair.
30
+
31
+ You can jump to the following sections:
32
+
33
+ * [Installation and Setup](#Installation) when Altair is not installed correctly
34
+ * [Display Issues](#Display-Troubleshooting) when you don't see a chart
35
+ * [Invalid Specifications](#Invalid-Specifications) when you get an error
36
+ * [Properties are Being Ignored](#Properties-are-Being-Ignored) when you don't see any errors or warnings
37
+ * [Asking for Help](#Asking-for-Help) when you get stuck
38
+ * [Reporting Issues](#Reporting-Issues) when you find a bug
39
+
40
+ In addition to this notebook, you might find the [Frequently Asked Questions](https://altair-viz.github.io/user_guide/faq.html) and [Display Troubleshooting](https://altair-viz.github.io/user_guide/troubleshooting.html) guides helpful.
41
+
42
+ _This notebook is part of the [data visualization curriculum](https://github.com/uwdata/visualization-curriculum)._
43
+ """)
44
+ return
45
+
46
+
47
+ @app.cell(hide_code=True)
48
+ def _(mo):
49
+ mo.md(r"""
50
+ ## Installation
51
+ """)
52
+ return
53
+
54
+
55
+ @app.cell(hide_code=True)
56
+ def _(mo):
57
+ mo.md(r"""
58
+ These instructions follow [the Altair documentation](https://altair-viz.github.io/getting_started/installation.html) but focus on some specifics for this series of notebooks.
59
+
60
+ In every notebook, we will import the [Altair](https://github.com/altair-viz/altair) and [Vega Datasets](https://github.com/altair-viz/vega_datasets) packages. If you are running this notebook on [Colab](https://colab.research.google.com), Altair and Vega Datasets should be preinstalled and ready to go. The notebooks in this series are designed for Colab but should also work in Jupyter Lab or the Jupyter Notebook (the notebook requires a bit more setup [described below](#Special-Setup-for-the-Jupyter-Notebook)) but additional packages are required.
61
+
62
+ If you are running in Jupyter Lab or Jupyter Notebooks, you have to install the necessary packages by running the following command in your terminal.
63
+
64
+ ```bash
65
+ pip install altair vega_datasets
66
+ ```
67
+
68
+ Or if you use [Conda](https://conda.io)
69
+
70
+ ```bash
71
+ conda install -c conda-forge altair vega_datasets
72
+ ```
73
+
74
+ You can run command line commands from a code cell by prefixing it with `!`. For example, to install Altair and Vega Datasets with [Pip](https://pip.pypa.io/), you can run the following cell.
75
+ """)
76
+ return
77
+
78
+
79
+ @app.cell
80
+ def _():
81
+ # packages added via marimo's package management: altair vega_datasets !pip install altair vega_datasets
82
+ return
83
+
84
+
85
+ @app.cell
86
+ def _():
87
+ import altair as alt
88
+ from vega_datasets import data
89
+
90
+ return alt, data
91
+
92
+
93
+ @app.cell(hide_code=True)
94
+ def _(mo):
95
+ mo.md(r"""
96
+ ### Make sure you are Using the Latest Version of Altair
97
+ """)
98
+ return
99
+
100
+
101
+ @app.cell(hide_code=True)
102
+ def _(mo):
103
+ mo.md(r"""
104
+ If you are running into issues with Altair, first make sure that you are running the latest version. To check the version of Altair that you have installed, run the cell below.
105
+ """)
106
+ return
107
+
108
+
109
+ @app.cell
110
+ def _(alt):
111
+ alt.__version__
112
+ return
113
+
114
+
115
+ @app.cell(hide_code=True)
116
+ def _(mo):
117
+ mo.md(r"""
118
+ To check what the latest version of altair is, go to [this page](https://pypi.org/project/altair/) or run the cell below (requires Python 3).
119
+ """)
120
+ return
121
+
122
+
123
+ @app.cell
124
+ def _():
125
+ import urllib.request, json
126
+ with urllib.request.urlopen("https://pypi.org/pypi/altair/json") as url:
127
+ print(json.loads(url.read().decode())['info']['version'])
128
+ return
129
+
130
+
131
+ @app.cell(hide_code=True)
132
+ def _(mo):
133
+ mo.md(r"""
134
+ If you are not running the latest version, you can update it with `pip`. You can update Altair and Vega Datasets by running this command in your terminal.
135
+
136
+ ```
137
+ pip install -U altair vega_datasets
138
+ ```
139
+ """)
140
+ return
141
+
142
+
143
+ @app.cell(hide_code=True)
144
+ def _(mo):
145
+ mo.md(r"""
146
+ ### Try Making a Chart
147
+ """)
148
+ return
149
+
150
+
151
+ @app.cell(hide_code=True)
152
+ def _(mo):
153
+ mo.md(r"""
154
+ Now you can create an Altair chart.
155
+ """)
156
+ return
157
+
158
+
159
+ @app.cell
160
+ def _(alt, data):
161
+ cars = data.cars()
162
+
163
+ alt.Chart(cars).mark_point().encode(
164
+ x='Horsepower',
165
+ y='Displacement',
166
+ color='Origin'
167
+ )
168
+ return (cars,)
169
+
170
+
171
+ @app.cell(hide_code=True)
172
+ def _(mo):
173
+ mo.md(r"""
174
+ ### Special Setup for the Jupyter Notebook
175
+ """)
176
+ return
177
+
178
+
179
+ @app.cell(hide_code=True)
180
+ def _(mo):
181
+ mo.md(r"""
182
+ If you are running in Jupyter Lab, Jupyter Notebook, or Colab (and have a working Internet connection) you should be seeing a chart. If you are running in another environment (or offline), you will need to tell Altair to use a different renderer;
183
+
184
+ To activate a different renderer in a notebook cell:
185
+
186
+ ```python
187
+ # to run in nteract, VSCode, or offline in JupyterLab
188
+ alt.renderers.enable('mimebundle')
189
+
190
+ ```
191
+
192
+ To run offline in Jupyter Notebook you must install an additional dependency, the `vega` package. Run this command in your terminal:
193
+
194
+ ```bash
195
+ pip install vega
196
+ ```
197
+
198
+ Then activate the notebook renderer:
199
+
200
+ ```python
201
+ # to run offline in Jupyter Notebook
202
+ alt.renderers.enable('notebook')
203
+
204
+ ```
205
+
206
+
207
+ These instruction follow [the instructions on the Altair website](https://altair-viz.github.io/getting_started/installation.html#installation-notebook).
208
+ """)
209
+ return
210
+
211
+
212
+ @app.cell(hide_code=True)
213
+ def _(mo):
214
+ mo.md(r"""
215
+ ## Display Troubleshooting
216
+
217
+ If you are having issues with seeing a chart, make sure your setup is correct by following the [debugging instruction above](#Installation). If you are still having issues, follow the [instruction about debugging display issues in the Altair documentation](https://iliatimofeev.github.io/altair-viz.github.io/user_guide/troubleshooting.html).
218
+ """)
219
+ return
220
+
221
+
222
+ @app.cell(hide_code=True)
223
+ def _(mo):
224
+ mo.md(r"""
225
+ ### Non Existent Fields
226
+
227
+ A common error is [accidentally using a field that does not exist](https://iliatimofeev.github.io/altair-viz.github.io/user_guide/troubleshooting.html#plot-displays-but-the-content-is-empty).
228
+ """)
229
+ return
230
+
231
+
232
+ @app.cell
233
+ def _(alt):
234
+ import pandas as pd
235
+
236
+ df = pd.DataFrame({'x': [1, 2, 3],
237
+ 'y': [3, 1, 4]})
238
+
239
+ alt.Chart(df).mark_point().encode(
240
+ x='x:Q',
241
+ y='y:Q',
242
+ color='color:Q' # <-- this field does not exist in the data!
243
+ )
244
+ return (df,)
245
+
246
+
247
+ @app.cell(hide_code=True)
248
+ def _(mo):
249
+ mo.md(r"""
250
+ Check the spelling of your files and print the data source to confirm that the data and fields exist. For instance, here you see that `color` is not a vaid field.
251
+ """)
252
+ return
253
+
254
+
255
+ @app.cell
256
+ def _(df):
257
+ df.head()
258
+ return
259
+
260
+
261
+ @app.cell(hide_code=True)
262
+ def _(mo):
263
+ mo.md(r"""
264
+ ## Invalid Specifications
265
+
266
+ Another common issue is creating an invalid specification and getting an error.
267
+ """)
268
+ return
269
+
270
+
271
+ @app.cell(hide_code=True)
272
+ def _(mo):
273
+ mo.md(r"""
274
+ ### Invalid Properties
275
+
276
+ Altair might show an `SchemaValidationError` or `ValueError`. Read the error message carefully. Usually it will tell you what is going wrong.
277
+ """)
278
+ return
279
+
280
+
281
+ @app.cell(hide_code=True)
282
+ def _(mo):
283
+ mo.md(r"""
284
+ For example, if you forget the mark type, you will see this `SchemaValidationError`.
285
+ """)
286
+ return
287
+
288
+
289
+ @app.cell
290
+ def _(alt, cars):
291
+ alt.Chart(cars).encode(
292
+ y='Horsepower'
293
+ )
294
+ return
295
+
296
+
297
+ @app.cell(hide_code=True)
298
+ def _(mo):
299
+ mo.md(r"""
300
+ Or if you use a non-existent channel, you get a `TypeError`.
301
+ """)
302
+ return
303
+
304
+
305
+ @app.cell
306
+ def _(alt, cars):
307
+ try:
308
+ alt.Chart(cars).mark_point().encode(
309
+ z='Horsepower'
310
+ )
311
+ except TypeError as e:
312
+ print(f"TypeError: {e}")
313
+ return
314
+
315
+
316
+ @app.cell(hide_code=True)
317
+ def _(mo):
318
+ mo.md(r"""
319
+ ## Properties are Being Ignored
320
+
321
+ Altair might ignore a property that you specified. In the chart below, we are using a `text` channel, which is only compatible with `mark_text`. You do not see an error or a warning about this in the notebook. However, the underlying Vega-Lite library will show a warning in the browser console. Press <kbd>Alt</kbd>+<kbd>Cmd</kbd>+<kbd>I</kbd> on Mac or <kbd>Alt</kbd>+<kbd>Ctrl</kbd>+<kbd>I</kbd> on Windows and Linux to open the developer tools and click on the `Console` tab. When you run the example in the cell below, you will see a the following warning.
322
+
323
+ ```
324
+ WARN text dropped as it is incompatible with "bar".
325
+ ```
326
+ """)
327
+ return
328
+
329
+
330
+ @app.cell
331
+ def _(alt, cars):
332
+ alt.Chart(cars).mark_bar().encode(
333
+ y='mean(Horsepower)',
334
+ text='mean(Acceleration)'
335
+ )
336
+ return
337
+
338
+
339
+ @app.cell(hide_code=True)
340
+ def _(mo):
341
+ mo.md(r"""
342
+ If you find yourself debugging issues related to Vega-Lite, you can open the chart in the [Vega Editor](https://vega.github.io/editor/) either by clicking on the "Open in Vega Editor" link at the bottom of the chart or in the action menu (click to open) at the top right of a chart. The Vega Editor provides additional debugging but you will be writing Vega-Lite JSON instead of Altair in Python.
343
+
344
+ **Note**: The Vega Editor may be using a newer version of Vega-Lite and so the behavior may vary.
345
+ """)
346
+ return
347
+
348
+
349
+ @app.cell(hide_code=True)
350
+ def _(mo):
351
+ mo.md(r"""
352
+ ## Asking for Help
353
+
354
+ If you find a problem with Altair and get stuck, you can ask a question on Stack Overflow. Ask your question with the `altair` and `vega-lite` tags. You can find a list of questions people have asked before [here](https://stackoverflow.com/questions/tagged/altair).
355
+ """)
356
+ return
357
+
358
+
359
+ @app.cell(hide_code=True)
360
+ def _(mo):
361
+ mo.md(r"""
362
+ ## Reporting Issues
363
+
364
+ If you find a problem with Altair and believe it is a bug, please [create an issue in the Altair GitHub repo](https://github.com/altair-viz/altair/issues/new) with a description of your problem. If you believe the issue is related to the underlying Vega-Lite library, please [create an issue in the Vega-Lite GitHub repo](https://github.com/vega/vega-lite/issues/new).
365
+ """)
366
+ return
367
+
368
+
369
+ if __name__ == "__main__":
370
+ app.run()
altair/altair_introduction.py.lock ADDED
The diff for this file is too large to render. See raw diff
 
altair/index.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Learn Altair
3
+ description: >
4
+ Learn the basics of Altair, a high-performance visualization library,
5
+ using lessons developed at the University of Washington.
6
+ ---
7
+
8
+ ## Acknowledgments
9
+
10
+ These notebooks were created by Jeffrey Heer, Dominik Moritz, Jake VanderPlas, and Brock Craft
11
+ as part of the [Visualization Curriculum](https://uwdata.github.io/visualization-curriculum/intro.html)
12
+ at the University of Washington.
13
+ Our thanks to the authors for making their work available under an open license:
14
+ if we all share a little, we all get a lot.
assets/styles.css ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ :root {
2
+ --primary-green: #10B981;
3
+ --dark-green: #047857;
4
+ --light-green: #D1FAE5;
5
+ }
6
+ .bg-primary { background-color: var(--primary-green); }
7
+ .text-primary { color: var(--primary-green); }
8
+ .border-primary { border-color: var(--primary-green); }
9
+ .bg-light { background-color: var(--light-green); }
10
+ .hover-grow { transition: transform 0.2s ease; }
11
+ .hover-grow:hover { transform: scale(1.02); }
12
+ .card-shadow { box-shadow: 0 4px 6px rgba(0, 0, 0, 0.05), 0 1px 3px rgba(0, 0, 0, 0.1); }
13
+
14
+ /* Prose styles for markdown-generated content */
15
+ .prose h1 { font-size: 1.875rem; font-weight: 700; color: #1f2937; margin: 1.5rem 0 0.75rem; }
16
+ .prose h2 { font-size: 1.5rem; font-weight: 700; color: #1f2937; margin: 1.5rem 0 0.75rem; }
17
+ .prose h3 { font-size: 1.25rem; font-weight: 600; color: #1f2937; margin: 1.25rem 0 0.5rem; }
18
+ .prose h4 { font-size: 1.125rem; font-weight: 600; color: #1f2937; margin: 1rem 0 0.5rem; }
19
+ .prose p { color: #4b5563; margin-bottom: 1rem; line-height: 1.75; }
20
+ .prose ul { list-style-type: disc; padding-left: 1.25rem; margin-bottom: 1rem; color: #4b5563; }
21
+ .prose ol { list-style-type: decimal; padding-left: 1.25rem; margin-bottom: 1rem; color: #4b5563; }
22
+ .prose li { margin-bottom: 0.25rem; line-height: 1.75; }
23
+ .prose a { color: var(--primary-green); }
24
+ .prose a:hover { color: var(--dark-green); }
25
+ .prose strong { font-weight: 600; }
26
+ .prose code { font-family: ui-monospace, monospace; font-size: 0.875em;
27
+ background-color: #f3f4f6; padding: 0.1em 0.3em; border-radius: 0.25rem; }
28
+ .prose pre { background-color: #f3f4f6; color: #1f2937; padding: 1rem;
29
+ border-radius: 0.5rem; overflow-x: auto; margin-bottom: 1rem; }
30
+ .prose pre code { background: none; padding: 0; font-size: 0.875rem; color: inherit; }
31
+
32
+ /* Component classes */
33
+ .logo-container { background-color: var(--light-green); padding: 0.25rem; border-radius: 0.5rem; }
34
+ .card-accent { height: 0.5rem; background-color: var(--primary-green); }
35
+ .feature-card { background-color: #ffffff; padding: 1.5rem; border-radius: 0.5rem;
36
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.05), 0 1px 3px rgba(0, 0, 0, 0.1); }
37
+ .content-card { background-color: #ffffff; border: 1px solid #e5e7eb; border-radius: 0.5rem;
38
+ overflow: hidden; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.05), 0 1px 3px rgba(0, 0, 0, 0.1); }
39
+ .icon-container { width: 3rem; height: 3rem; background-color: var(--light-green);
40
+ border-radius: 9999px; display: flex; align-items: center;
41
+ justify-content: center; margin-bottom: 1rem; }
42
+
43
+ .link-primary { color: var(--primary-green); }
44
+ .link-primary:hover { color: var(--dark-green); }
45
+
46
+ .btn-primary { background-color: var(--primary-green); color: #ffffff; font-weight: 500;
47
+ border-radius: 0.375rem; transition: background-color 300ms ease-in-out; }
48
+ .btn-primary:hover { background-color: var(--dark-green); }
49
+
50
+ .footer-link { color: #d1d5db; transition: color 300ms ease-in-out; }
51
+ .footer-link:hover { color: #ffffff; }
bin/build.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Generate a static site from Jinja2 templates and lesson data."""
3
+
4
+ import argparse
5
+ import datetime
6
+ import json
7
+ import re
8
+ import shutil
9
+ from pathlib import Path
10
+
11
+ import frontmatter
12
+ import markdown as md
13
+ from jinja2 import Environment, FileSystemLoader
14
+
15
+ from utils import get_notebook_title
16
+
17
+
18
+ def transform_lessons(data: dict, root: Path) -> dict:
19
+ """Transform raw lesson data into template-ready form."""
20
+ for course_id, course in data.items():
21
+ desc = course.get("description", "").strip()
22
+ course["description_html"] = f"<p>{desc}</p>" if desc else ""
23
+ course["notebooks"] = [
24
+ {
25
+ "title": get_notebook_title(root / course_id / nb)
26
+ or re.sub(r"^\d+_", "", nb.replace(".py", "")).replace("_", " ").title(),
27
+ "html_path": f"{course_id}/{nb.replace('.py', '.html')}",
28
+ "local_html_path": nb.replace(".py", ".html"),
29
+ }
30
+ for nb in course.get("notebooks", [])
31
+ ]
32
+ index_md = root / course_id / "index.md"
33
+ post = frontmatter.load(index_md)
34
+ course["body_html"] = md.markdown(post.content, extensions=["fenced_code", "tables"])
35
+ return data
36
+
37
+
38
+ def render(template, path, **kwargs):
39
+ path.parent.mkdir(parents=True, exist_ok=True)
40
+ path.write_text(template.render(**kwargs))
41
+
42
+
43
+ def main():
44
+ parser = argparse.ArgumentParser(description="Generate static site from lesson data")
45
+ parser.add_argument("--root", required=True, help="Project root directory")
46
+ parser.add_argument("--output", required=True, help="Output directory")
47
+ parser.add_argument("--data", required=True, help="Path to lessons JSON file")
48
+ args = parser.parse_args()
49
+
50
+ root = Path(args.root)
51
+ output = Path(args.output)
52
+ output.mkdir(parents=True, exist_ok=True)
53
+
54
+ lessons = transform_lessons(json.loads(Path(args.data).read_text()), root)
55
+ env = Environment(loader=FileSystemLoader(root / "templates"))
56
+ current_year = datetime.date.today().year
57
+
58
+ render(
59
+ env.get_template("index.html"),
60
+ output / "index.html",
61
+ courses=lessons,
62
+ current_year=current_year,
63
+ root_path="",
64
+ )
65
+
66
+ assets_src = root / "assets"
67
+ if assets_src.exists():
68
+ shutil.copytree(assets_src, output / "assets", dirs_exist_ok=True)
69
+
70
+ for course_id, lesson in lessons.items():
71
+ render(
72
+ env.get_template("lesson.html"),
73
+ output / course_id / "index.html",
74
+ lesson=lesson,
75
+ current_year=current_year,
76
+ root_path="../",
77
+ )
78
+
79
+ page_template = env.get_template("page.html")
80
+ for page_src in sorted((root / "pages").glob("*.md")):
81
+ post = frontmatter.load(page_src)
82
+ render(
83
+ page_template,
84
+ output / page_src.stem / "index.html",
85
+ title=post.get("title", page_src.stem),
86
+ body_html=md.markdown(post.content, extensions=["fenced_code", "tables"]),
87
+ current_year=current_year,
88
+ root_path="../",
89
+ )
90
+
91
+
92
+ if __name__ == "__main__":
93
+ main()
{scripts → bin}/check_empty_cells.py RENAMED
@@ -1,4 +1,4 @@
1
- #!/usr/bin/env python3
2
  """
3
  Script to detect empty cells in marimo notebooks.
4
 
@@ -15,7 +15,6 @@ This script will:
15
  """
16
 
17
  import os
18
- import re
19
  import sys
20
  from pathlib import Path
21
  from typing import List, Tuple
@@ -136,4 +135,4 @@ def main():
136
 
137
 
138
  if __name__ == "__main__":
139
- main()
 
1
+ #!/usr/bin/env python
2
  """
3
  Script to detect empty cells in marimo notebooks.
4
 
 
15
  """
16
 
17
  import os
 
18
  import sys
19
  from pathlib import Path
20
  from typing import List, Tuple
 
135
 
136
 
137
  if __name__ == "__main__":
138
+ main()
bin/check_missing_titles.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Report marimo notebooks that are missing an H1 title."""
3
+
4
+ import sys
5
+ from pathlib import Path
6
+
7
+ from utils import get_notebook_title
8
+
9
+
10
+ def main():
11
+ root = Path(__file__).parent.parent
12
+ notebooks = sorted(root.glob("*/[0-9]*.py"))
13
+ missing = [nb for nb in notebooks if get_notebook_title(nb) is None]
14
+ if missing:
15
+ for nb in missing:
16
+ print(nb.relative_to(root))
17
+ sys.exit(1)
18
+
19
+
20
+ if __name__ == "__main__":
21
+ main()
bin/check_notebook_packages.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Check that marimo notebooks in the same lesson directory agree on package versions.
3
+
4
+ It is acceptable for different notebooks in a directory to specify different packages,
5
+ but if two or more notebooks specify the same package, their version constraints must
6
+ be identical.
7
+ """
8
+
9
+ import argparse
10
+ import re
11
+ import sys
12
+ from collections import defaultdict
13
+ from pathlib import Path
14
+
15
+
16
+ # Regex to extract the inline script metadata block (PEP 723)
17
+ SCRIPT_BLOCK_RE = re.compile(r"^# /// script\s*\n((?:#[^\n]*\n)*?)# ///", re.MULTILINE)
18
+ DEPENDENCY_LINE_RE = re.compile(r'^#\s+"([^"]+)",?\s*$')
19
+
20
+
21
+ def parse_script_header(text: str) -> list[str]:
22
+ """Return the list of dependency strings from a PEP 723 script header, or []."""
23
+ match = SCRIPT_BLOCK_RE.search(text)
24
+ if not match:
25
+ return []
26
+ block = match.group(1)
27
+ deps: list[str] = []
28
+ in_deps = False
29
+ for raw_line in block.splitlines():
30
+ line = raw_line.lstrip("#").strip()
31
+ if line.startswith("dependencies"):
32
+ in_deps = True
33
+ continue
34
+ if in_deps:
35
+ if line.startswith("]"):
36
+ break
37
+ # strip surrounding quotes and comma: e.g. ' "polars==1.0",' -> 'polars==1.0'
38
+ stripped = line.strip().strip('"\'').rstrip(",").strip('"\'')
39
+ if stripped:
40
+ deps.append(stripped)
41
+ return deps
42
+
43
+
44
+ def package_name(dep: str) -> str:
45
+ """Extract the bare package name from a PEP 508 dependency string.
46
+
47
+ Examples:
48
+ "polars==1.22.0" -> "polars"
49
+ "pandas>=2.0,<3" -> "pandas"
50
+ "marimo" -> "marimo"
51
+ """
52
+ return re.split(r"[><=!;\s\[]", dep, maxsplit=1)[0].lower()
53
+
54
+
55
+ def check_directory(lesson_dir: Path, only: set[str]) -> list[str]:
56
+ """Return a list of error messages for version inconsistencies among *only* in lesson_dir."""
57
+ # Map package name -> {version_spec: [notebook_path, ...]}
58
+ seen: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
59
+
60
+ for nb in sorted(lesson_dir.glob("*.py")):
61
+ if nb.name not in only:
62
+ continue
63
+ try:
64
+ text = nb.read_text(encoding="utf-8")
65
+ except IOError:
66
+ continue
67
+ if "marimo.App" not in text:
68
+ continue
69
+ for dep in parse_script_header(text):
70
+ name = package_name(dep)
71
+ seen[name][dep].append(nb.name)
72
+
73
+ errors: list[str] = []
74
+ for name, specs in sorted(seen.items()):
75
+ if len(specs) > 1:
76
+ errors.append(f" Package '{name}' has conflicting specifications:")
77
+ for spec, files in sorted(specs.items()):
78
+ errors.append(f" {spec!r} in: {', '.join(files)}")
79
+ return errors
80
+
81
+
82
+ def main() -> None:
83
+ parser = argparse.ArgumentParser(description=__doc__)
84
+ parser.add_argument("notebooks", nargs="+", metavar="NOTEBOOK",
85
+ help="notebook files to check (grouped by directory)")
86
+ args = parser.parse_args()
87
+
88
+ dir_filter: dict[Path, set[str]] = defaultdict(set)
89
+ for nb_path in (Path(p) for p in args.notebooks):
90
+ dir_filter[nb_path.parent].add(nb_path.name)
91
+
92
+ total_errors = 0
93
+ for lesson_dir, only in sorted(dir_filter.items()):
94
+ errors = check_directory(lesson_dir, only=only)
95
+ if errors:
96
+ print(f"\n{lesson_dir}/")
97
+ for msg in errors:
98
+ print(msg)
99
+ total_errors += len(errors)
100
+
101
+ if total_errors:
102
+ print(f"\nFound package version inconsistencies in {total_errors} package(s).")
103
+ sys.exit(1)
104
+ else:
105
+ print("All package version specifications are consistent.")
106
+ sys.exit(0)
107
+
108
+
109
+ if __name__ == "__main__":
110
+ main()
bin/create_sql_lab.sql ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ create table job (
2
+ name text not null,
3
+ credits real not null
4
+ );
5
+
6
+ create table work (
7
+ person text not null,
8
+ job text not null
9
+ );
10
+
11
+ insert into job values
12
+ ('calibrate', 1.5),
13
+ ('clean', 0.5);
14
+
15
+ insert into work values
16
+ ('Amal', 'calibrate'),
17
+ ('Amal', 'clean'),
18
+ ('Amal', 'complain'),
19
+ ('Gita', 'clean'),
20
+ ('Gita', 'clean'),
21
+ ('Gita', 'complain'),
22
+ ('Madhi', 'complain');
bin/create_sql_penguins.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ import csv
4
+ import sqlite3
5
+ import sys
6
+
7
+
8
+ SCHEMA = """
9
+ CREATE TABLE penguins (
10
+ species text,
11
+ island text,
12
+ bill_length_mm real,
13
+ bill_depth_mm real,
14
+ flipper_length_mm real,
15
+ body_mass_g real,
16
+ sex text
17
+ );
18
+ """
19
+
20
+ def main():
21
+ infile = sys.argv[1]
22
+ outfile = sys.argv[2]
23
+
24
+ con = sqlite3.connect(outfile)
25
+ con.execute(SCHEMA)
26
+
27
+ with open(infile, newline="") as f:
28
+ reader = csv.DictReader(f)
29
+ rows = [
30
+ (
31
+ row["species"],
32
+ row["island"],
33
+ float(row["bill_length_mm"]) if row["bill_length_mm"] else None,
34
+ float(row["bill_depth_mm"]) if row["bill_depth_mm"] else None,
35
+ float(row["flipper_length_mm"]) if row["flipper_length_mm"] else None,
36
+ float(row["body_mass_g"]) if row["body_mass_g"] else None,
37
+ row["sex"] if row["sex"] else None,
38
+ )
39
+ for row in reader
40
+ ]
41
+
42
+ con.executemany(
43
+ "INSERT INTO penguins VALUES (?, ?, ?, ?, ?, ?, ?)", rows
44
+ )
45
+ con.commit()
46
+ con.close()
47
+
48
+
49
+ if __name__ == "__main__":
50
+ main()
bin/create_sql_survey.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+ import datetime
4
+ import faker
5
+ import itertools
6
+ import random
7
+ import sqlite3
8
+ import sys
9
+
10
+
11
+ LOCALE = "es"
12
+
13
+ NUM_PERSONS = 6
14
+
15
+ DATE_START = datetime.date(2025, 9, 1)
16
+ DATE_END = datetime.date(2025, 12, 31)
17
+ DATE_DURATION = 7
18
+
19
+ NUM_MACHINES = 5
20
+
21
+ CREATE_PERSONS = """\
22
+ create table person(
23
+ person_id text not null primary key,
24
+ personal text not null,
25
+ family text not null,
26
+ supervisor_id text,
27
+ foreign key(supervisor_id) references person(person_id)
28
+ );
29
+ """
30
+ INSERT_PERSONS = """\
31
+ insert into person values (:person_id, :personal, :family, :supervisor_id);
32
+ """
33
+
34
+ CREATE_SURVEYS = """\
35
+ create table survey(
36
+ survey_id text not null primary key,
37
+ person_id text not null,
38
+ start_date text,
39
+ end_date text,
40
+ foreign key(person_id) references person(person_id)
41
+ );
42
+ """
43
+ INSERT_SURVEYS = """\
44
+ insert into survey values(:survey_id, :person_id, :start, :end);
45
+ """
46
+
47
+ CREATE_MACHINES = """\
48
+ create table machine(
49
+ machine_id text not null primary key,
50
+ machine_type text not null
51
+ );
52
+ """
53
+ INSERT_MACHINES = """\
54
+ insert into machine values(:machine_id, :machine_type);
55
+ """
56
+
57
+ CREATE_RATINGS = """\
58
+ create table rating(
59
+ person_id text not null,
60
+ machine_id text not null,
61
+ level integer,
62
+ foreign key(person_id) references person(person_id),
63
+ foreign key(machine_id) references machine(machine_id)
64
+ );
65
+ """
66
+ INSERT_RATINGS = """\
67
+ insert into rating values(:person_id, :machine_id, :level);
68
+ """
69
+
70
+ def main():
71
+ db_name = sys.argv[1]
72
+ seed = int(sys.argv[2])
73
+ random.seed(seed)
74
+
75
+ persons_counter = itertools.count()
76
+ next(persons_counter)
77
+ persons = gen_persons(NUM_PERSONS, persons_counter)
78
+
79
+ supers = gen_persons(int(NUM_PERSONS / 2), persons_counter)
80
+ for p in persons:
81
+ p["supervisor_id"] = random.choice(supers)["person_id"]
82
+ if len(supers) > 1:
83
+ supers[0]["supervisor_id"] = supers[-1]["person_id"]
84
+
85
+ surveys = gen_surveys(persons + supers[0:int(len(supers)/2)])
86
+ surveys[int(len(surveys)/2)]["start"] = None
87
+
88
+ cnx = sqlite3.connect(db_name)
89
+ cur = cnx.cursor()
90
+
91
+ everyone = persons + supers
92
+ random.shuffle(everyone)
93
+ cur.execute(CREATE_PERSONS)
94
+ cur.executemany(INSERT_PERSONS, everyone)
95
+
96
+ cur.execute(CREATE_SURVEYS)
97
+ cur.executemany(INSERT_SURVEYS, surveys)
98
+
99
+ machines = gen_machines()
100
+ cur.execute(CREATE_MACHINES)
101
+ cur.executemany(INSERT_MACHINES, machines)
102
+
103
+ ratings = gen_ratings(everyone, machines)
104
+ cur.execute(CREATE_RATINGS)
105
+ cur.executemany(INSERT_RATINGS, ratings)
106
+
107
+ cnx.commit()
108
+ cnx.close()
109
+
110
+
111
+ def gen_machines():
112
+ adjectives = "hydraulic rotary modular industrial automated".split()
113
+ nouns = "press conveyor generator actuator compressor".split()
114
+ machines = set()
115
+ while len(machines) < NUM_MACHINES:
116
+ candidate = f"{random.choice(adjectives)} {random.choice(nouns)}"
117
+ if candidate not in machines:
118
+ machines.add(candidate)
119
+ counter = itertools.count()
120
+ next(counter)
121
+ return [
122
+ {"machine_id": f"M{next(counter):04d}", "machine_type": m}
123
+ for m in machines
124
+ ]
125
+
126
+
127
+ def gen_persons(num, counter):
128
+ fake = faker.Faker(LOCALE)
129
+ fake.seed_instance(random.randint(0, 1_000_000))
130
+ return [
131
+ {
132
+ "person_id": f"P{next(counter):03d}",
133
+ "personal": fake.first_name(),
134
+ "family": fake.last_name(),
135
+ "supervisor_id": None,
136
+ }
137
+ for _ in range(num)
138
+ ]
139
+
140
+
141
+ def gen_ratings(persons, machines):
142
+ temp = {}
143
+ while len(temp) < int(len(persons) * len(machines) / 4):
144
+ p = random.choice(persons)["person_id"]
145
+ m = random.choice(machines)["machine_id"]
146
+ if (p, m) in temp:
147
+ continue
148
+ temp[(p, m)] = random.choice([None, 1, 2, 3])
149
+ return [
150
+ {"person_id": p, "machine_id": m, "level": v}
151
+ for ((p, m), v) in temp.items()
152
+ ]
153
+
154
+ def gen_surveys(persons):
155
+ surveys = []
156
+ counter = itertools.count()
157
+ next(counter)
158
+ for person in persons:
159
+ person_id = person["person_id"]
160
+ start = DATE_START
161
+ while start <= DATE_END:
162
+ survey_id = f"S{next(counter):04d}"
163
+ end = start + datetime.timedelta(days=random.randint(1, DATE_DURATION))
164
+ surveys.append({
165
+ "survey_id": survey_id,
166
+ "person_id": person_id,
167
+ "start": start.isoformat(),
168
+ "end": end.isoformat() if end <= DATE_END else None
169
+ })
170
+ start = end + datetime.timedelta(days=random.randint(1, DATE_DURATION))
171
+ return surveys
172
+
173
+
174
+ if __name__ == "__main__":
175
+ main()
bin/extract.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ """Extract lesson metadata and notebook lists into a JSON file."""
3
+
4
+ import argparse
5
+ import json
6
+ import re
7
+ from pathlib import Path
8
+
9
+ import frontmatter
10
+
11
+
12
+ NOTEBOOK_PATTERN = re.compile(r"^\d{2}_.*\.py$")
13
+
14
+
15
+ def extract_lessons(root: Path) -> dict:
16
+ lessons = {}
17
+ for index_file in sorted(root.glob("*/index.md")):
18
+ lesson_dir = index_file.parent
19
+ post = frontmatter.load(index_file)
20
+ notebooks = sorted(
21
+ p.name
22
+ for p in lesson_dir.glob("*.py")
23
+ if NOTEBOOK_PATTERN.match(p.name)
24
+ )
25
+ lessons[lesson_dir.name] = {
26
+ **post.metadata,
27
+ "notebooks": notebooks,
28
+ }
29
+ return lessons
30
+
31
+
32
+ def main():
33
+ parser = argparse.ArgumentParser(description="Extract lesson metadata to JSON")
34
+ parser.add_argument("--root", required=True, help="Project root directory")
35
+ parser.add_argument("--data", required=True, help="Output JSON file")
36
+ args = parser.parse_args()
37
+
38
+ root = Path(args.root)
39
+ data = Path(args.data)
40
+ data.parent.mkdir(parents=True, exist_ok=True)
41
+
42
+ lessons = extract_lessons(root)
43
+ data.write_text(json.dumps(lessons, indent=2))
44
+
45
+
46
+ if __name__ == "__main__":
47
+ main()
{scripts → bin}/preview.py RENAMED
@@ -1,10 +1,9 @@
1
- #!/usr/bin/env python3
2
 
3
  import os
4
  import subprocess
5
  import argparse
6
  import webbrowser
7
- import time
8
  import sys
9
  from pathlib import Path
10
 
 
1
+ #!/usr/bin/env python
2
 
3
  import os
4
  import subprocess
5
  import argparse
6
  import webbrowser
 
7
  import sys
8
  from pathlib import Path
9
 
bin/run_notebooks.sh ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ for nb in $*
3
+ do
4
+ cd $(dirname $nb)
5
+ if ! output=$(uv run $(basename $nb) 2>&1); then
6
+ echo "=== $nb ==="
7
+ echo "$output"
8
+ echo
9
+ fi
10
+ cd $OLDPWD
11
+ done
bin/utils.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Utility functions for working with marimo notebooks."""
2
+
3
+ import re
4
+ from pathlib import Path
5
+
6
+
7
+ def get_notebook_title(path: Path) -> str | None:
8
+ """Return the first H1 Markdown heading in a marimo notebook, or None."""
9
+ text = path.read_text(encoding="utf-8")
10
+ for match in re.finditer(r'mo\.md\(r?f?"""(.*?)"""', text, re.DOTALL):
11
+ for line in match.group(1).splitlines():
12
+ if line.strip().startswith("# "):
13
+ return line.strip()[2:].strip()
14
+ return None
daft/README.md DELETED
@@ -1,31 +0,0 @@
1
- ---
2
- title: Readme
3
- marimo-version: 0.18.4
4
- ---
5
-
6
- # Learn Daft
7
-
8
- _🚧 This collection is a work in progress. Please help us add notebooks!_
9
-
10
- This collection of marimo notebooks is designed to teach you the basics of
11
- Daft, a distributed dataframe engine that unifies data engineering, analytics & ML/AI workflows.
12
-
13
- **Help us build this course! ⚒️**
14
-
15
- We're seeking contributors to help us build these notebooks. Every contributor
16
- will be acknowledged as an author in this README and in their contributed
17
- notebooks. Head over to the [tracking
18
- issue](https://github.com/marimo-team/learn/issues/43) to sign up for a planned
19
- notebook or propose your own.
20
-
21
- **Running notebooks.** To run a notebook locally, use
22
-
23
- ```bash
24
- uvx marimo edit <file_url>
25
- ```
26
-
27
- You can also open notebooks in our online playground by appending marimo.app/ to a notebook's URL.
28
-
29
- **Thanks to all our notebook authors!**
30
-
31
- * [Péter Gyarmati](https://github.com/peter-gy)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
daft/_index.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Learn Daft
3
+ description: >
4
+ These notebooks introduce Daft, a distributed dataframe engine
5
+ that unifies data engineering, analysis, and ML/AI workflows.
6
+ tracking: 43
7
+ ---
8
+
9
+ ## Contributors
10
+
11
+ Thanks to our notebook authors:
12
+
13
+ * [Péter Gyarmati](https://github.com/peter-gy)
data/penguins.csv ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
2
+ Adelie,Torgersen,39.1,18.7,181,3750,MALE
3
+ Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
4
+ Adelie,Torgersen,40.3,18,195,3250,FEMALE
5
+ Adelie,Torgersen,,,,,
6
+ Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
7
+ Adelie,Torgersen,39.3,20.6,190,3650,MALE
8
+ Adelie,Torgersen,38.9,17.8,181,3625,FEMALE
9
+ Adelie,Torgersen,39.2,19.6,195,4675,MALE
10
+ Adelie,Torgersen,34.1,18.1,193,3475,
11
+ Adelie,Torgersen,42,20.2,190,4250,
12
+ Adelie,Torgersen,37.8,17.1,186,3300,
13
+ Adelie,Torgersen,37.8,17.3,180,3700,
14
+ Adelie,Torgersen,41.1,17.6,182,3200,FEMALE
15
+ Adelie,Torgersen,38.6,21.2,191,3800,MALE
16
+ Adelie,Torgersen,34.6,21.1,198,4400,MALE
17
+ Adelie,Torgersen,36.6,17.8,185,3700,FEMALE
18
+ Adelie,Torgersen,38.7,19,195,3450,FEMALE
19
+ Adelie,Torgersen,42.5,20.7,197,4500,MALE
20
+ Adelie,Torgersen,34.4,18.4,184,3325,FEMALE
21
+ Adelie,Torgersen,46,21.5,194,4200,MALE
22
+ Adelie,Biscoe,37.8,18.3,174,3400,FEMALE
23
+ Adelie,Biscoe,37.7,18.7,180,3600,MALE
24
+ Adelie,Biscoe,35.9,19.2,189,3800,FEMALE
25
+ Adelie,Biscoe,38.2,18.1,185,3950,MALE
26
+ Adelie,Biscoe,38.8,17.2,180,3800,MALE
27
+ Adelie,Biscoe,35.3,18.9,187,3800,FEMALE
28
+ Adelie,Biscoe,40.6,18.6,183,3550,MALE
29
+ Adelie,Biscoe,40.5,17.9,187,3200,FEMALE
30
+ Adelie,Biscoe,37.9,18.6,172,3150,FEMALE
31
+ Adelie,Biscoe,40.5,18.9,180,3950,MALE
32
+ Adelie,Dream,39.5,16.7,178,3250,FEMALE
33
+ Adelie,Dream,37.2,18.1,178,3900,MALE
34
+ Adelie,Dream,39.5,17.8,188,3300,FEMALE
35
+ Adelie,Dream,40.9,18.9,184,3900,MALE
36
+ Adelie,Dream,36.4,17,195,3325,FEMALE
37
+ Adelie,Dream,39.2,21.1,196,4150,MALE
38
+ Adelie,Dream,38.8,20,190,3950,MALE
39
+ Adelie,Dream,42.2,18.5,180,3550,FEMALE
40
+ Adelie,Dream,37.6,19.3,181,3300,FEMALE
41
+ Adelie,Dream,39.8,19.1,184,4650,MALE
42
+ Adelie,Dream,36.5,18,182,3150,FEMALE
43
+ Adelie,Dream,40.8,18.4,195,3900,MALE
44
+ Adelie,Dream,36,18.5,186,3100,FEMALE
45
+ Adelie,Dream,44.1,19.7,196,4400,MALE
46
+ Adelie,Dream,37,16.9,185,3000,FEMALE
47
+ Adelie,Dream,39.6,18.8,190,4600,MALE
48
+ Adelie,Dream,41.1,19,182,3425,MALE
49
+ Adelie,Dream,37.5,18.9,179,2975,
50
+ Adelie,Dream,36,17.9,190,3450,FEMALE
51
+ Adelie,Dream,42.3,21.2,191,4150,MALE
52
+ Adelie,Biscoe,39.6,17.7,186,3500,FEMALE
53
+ Adelie,Biscoe,40.1,18.9,188,4300,MALE
54
+ Adelie,Biscoe,35,17.9,190,3450,FEMALE
55
+ Adelie,Biscoe,42,19.5,200,4050,MALE
56
+ Adelie,Biscoe,34.5,18.1,187,2900,FEMALE
57
+ Adelie,Biscoe,41.4,18.6,191,3700,MALE
58
+ Adelie,Biscoe,39,17.5,186,3550,FEMALE
59
+ Adelie,Biscoe,40.6,18.8,193,3800,MALE
60
+ Adelie,Biscoe,36.5,16.6,181,2850,FEMALE
61
+ Adelie,Biscoe,37.6,19.1,194,3750,MALE
62
+ Adelie,Biscoe,35.7,16.9,185,3150,FEMALE
63
+ Adelie,Biscoe,41.3,21.1,195,4400,MALE
64
+ Adelie,Biscoe,37.6,17,185,3600,FEMALE
65
+ Adelie,Biscoe,41.1,18.2,192,4050,MALE
66
+ Adelie,Biscoe,36.4,17.1,184,2850,FEMALE
67
+ Adelie,Biscoe,41.6,18,192,3950,MALE
68
+ Adelie,Biscoe,35.5,16.2,195,3350,FEMALE
69
+ Adelie,Biscoe,41.1,19.1,188,4100,MALE
70
+ Adelie,Torgersen,35.9,16.6,190,3050,FEMALE
71
+ Adelie,Torgersen,41.8,19.4,198,4450,MALE
72
+ Adelie,Torgersen,33.5,19,190,3600,FEMALE
73
+ Adelie,Torgersen,39.7,18.4,190,3900,MALE
74
+ Adelie,Torgersen,39.6,17.2,196,3550,FEMALE
75
+ Adelie,Torgersen,45.8,18.9,197,4150,MALE
76
+ Adelie,Torgersen,35.5,17.5,190,3700,FEMALE
77
+ Adelie,Torgersen,42.8,18.5,195,4250,MALE
78
+ Adelie,Torgersen,40.9,16.8,191,3700,FEMALE
79
+ Adelie,Torgersen,37.2,19.4,184,3900,MALE
80
+ Adelie,Torgersen,36.2,16.1,187,3550,FEMALE
81
+ Adelie,Torgersen,42.1,19.1,195,4000,MALE
82
+ Adelie,Torgersen,34.6,17.2,189,3200,FEMALE
83
+ Adelie,Torgersen,42.9,17.6,196,4700,MALE
84
+ Adelie,Torgersen,36.7,18.8,187,3800,FEMALE
85
+ Adelie,Torgersen,35.1,19.4,193,4200,MALE
86
+ Adelie,Dream,37.3,17.8,191,3350,FEMALE
87
+ Adelie,Dream,41.3,20.3,194,3550,MALE
88
+ Adelie,Dream,36.3,19.5,190,3800,MALE
89
+ Adelie,Dream,36.9,18.6,189,3500,FEMALE
90
+ Adelie,Dream,38.3,19.2,189,3950,MALE
91
+ Adelie,Dream,38.9,18.8,190,3600,FEMALE
92
+ Adelie,Dream,35.7,18,202,3550,FEMALE
93
+ Adelie,Dream,41.1,18.1,205,4300,MALE
94
+ Adelie,Dream,34,17.1,185,3400,FEMALE
95
+ Adelie,Dream,39.6,18.1,186,4450,MALE
96
+ Adelie,Dream,36.2,17.3,187,3300,FEMALE
97
+ Adelie,Dream,40.8,18.9,208,4300,MALE
98
+ Adelie,Dream,38.1,18.6,190,3700,FEMALE
99
+ Adelie,Dream,40.3,18.5,196,4350,MALE
100
+ Adelie,Dream,33.1,16.1,178,2900,FEMALE
101
+ Adelie,Dream,43.2,18.5,192,4100,MALE
102
+ Adelie,Biscoe,35,17.9,192,3725,FEMALE
103
+ Adelie,Biscoe,41,20,203,4725,MALE
104
+ Adelie,Biscoe,37.7,16,183,3075,FEMALE
105
+ Adelie,Biscoe,37.8,20,190,4250,MALE
106
+ Adelie,Biscoe,37.9,18.6,193,2925,FEMALE
107
+ Adelie,Biscoe,39.7,18.9,184,3550,MALE
108
+ Adelie,Biscoe,38.6,17.2,199,3750,FEMALE
109
+ Adelie,Biscoe,38.2,20,190,3900,MALE
110
+ Adelie,Biscoe,38.1,17,181,3175,FEMALE
111
+ Adelie,Biscoe,43.2,19,197,4775,MALE
112
+ Adelie,Biscoe,38.1,16.5,198,3825,FEMALE
113
+ Adelie,Biscoe,45.6,20.3,191,4600,MALE
114
+ Adelie,Biscoe,39.7,17.7,193,3200,FEMALE
115
+ Adelie,Biscoe,42.2,19.5,197,4275,MALE
116
+ Adelie,Biscoe,39.6,20.7,191,3900,FEMALE
117
+ Adelie,Biscoe,42.7,18.3,196,4075,MALE
118
+ Adelie,Torgersen,38.6,17,188,2900,FEMALE
119
+ Adelie,Torgersen,37.3,20.5,199,3775,MALE
120
+ Adelie,Torgersen,35.7,17,189,3350,FEMALE
121
+ Adelie,Torgersen,41.1,18.6,189,3325,MALE
122
+ Adelie,Torgersen,36.2,17.2,187,3150,FEMALE
123
+ Adelie,Torgersen,37.7,19.8,198,3500,MALE
124
+ Adelie,Torgersen,40.2,17,176,3450,FEMALE
125
+ Adelie,Torgersen,41.4,18.5,202,3875,MALE
126
+ Adelie,Torgersen,35.2,15.9,186,3050,FEMALE
127
+ Adelie,Torgersen,40.6,19,199,4000,MALE
128
+ Adelie,Torgersen,38.8,17.6,191,3275,FEMALE
129
+ Adelie,Torgersen,41.5,18.3,195,4300,MALE
130
+ Adelie,Torgersen,39,17.1,191,3050,FEMALE
131
+ Adelie,Torgersen,44.1,18,210,4000,MALE
132
+ Adelie,Torgersen,38.5,17.9,190,3325,FEMALE
133
+ Adelie,Torgersen,43.1,19.2,197,3500,MALE
134
+ Adelie,Dream,36.8,18.5,193,3500,FEMALE
135
+ Adelie,Dream,37.5,18.5,199,4475,MALE
136
+ Adelie,Dream,38.1,17.6,187,3425,FEMALE
137
+ Adelie,Dream,41.1,17.5,190,3900,MALE
138
+ Adelie,Dream,35.6,17.5,191,3175,FEMALE
139
+ Adelie,Dream,40.2,20.1,200,3975,MALE
140
+ Adelie,Dream,37,16.5,185,3400,FEMALE
141
+ Adelie,Dream,39.7,17.9,193,4250,MALE
142
+ Adelie,Dream,40.2,17.1,193,3400,FEMALE
143
+ Adelie,Dream,40.6,17.2,187,3475,MALE
144
+ Adelie,Dream,32.1,15.5,188,3050,FEMALE
145
+ Adelie,Dream,40.7,17,190,3725,MALE
146
+ Adelie,Dream,37.3,16.8,192,3000,FEMALE
147
+ Adelie,Dream,39,18.7,185,3650,MALE
148
+ Adelie,Dream,39.2,18.6,190,4250,MALE
149
+ Adelie,Dream,36.6,18.4,184,3475,FEMALE
150
+ Adelie,Dream,36,17.8,195,3450,FEMALE
151
+ Adelie,Dream,37.8,18.1,193,3750,MALE
152
+ Adelie,Dream,36,17.1,187,3700,FEMALE
153
+ Adelie,Dream,41.5,18.5,201,4000,MALE
154
+ Chinstrap,Dream,46.5,17.9,192,3500,FEMALE
155
+ Chinstrap,Dream,50,19.5,196,3900,MALE
156
+ Chinstrap,Dream,51.3,19.2,193,3650,MALE
157
+ Chinstrap,Dream,45.4,18.7,188,3525,FEMALE
158
+ Chinstrap,Dream,52.7,19.8,197,3725,MALE
159
+ Chinstrap,Dream,45.2,17.8,198,3950,FEMALE
160
+ Chinstrap,Dream,46.1,18.2,178,3250,FEMALE
161
+ Chinstrap,Dream,51.3,18.2,197,3750,MALE
162
+ Chinstrap,Dream,46,18.9,195,4150,FEMALE
163
+ Chinstrap,Dream,51.3,19.9,198,3700,MALE
164
+ Chinstrap,Dream,46.6,17.8,193,3800,FEMALE
165
+ Chinstrap,Dream,51.7,20.3,194,3775,MALE
166
+ Chinstrap,Dream,47,17.3,185,3700,FEMALE
167
+ Chinstrap,Dream,52,18.1,201,4050,MALE
168
+ Chinstrap,Dream,45.9,17.1,190,3575,FEMALE
169
+ Chinstrap,Dream,50.5,19.6,201,4050,MALE
170
+ Chinstrap,Dream,50.3,20,197,3300,MALE
171
+ Chinstrap,Dream,58,17.8,181,3700,FEMALE
172
+ Chinstrap,Dream,46.4,18.6,190,3450,FEMALE
173
+ Chinstrap,Dream,49.2,18.2,195,4400,MALE
174
+ Chinstrap,Dream,42.4,17.3,181,3600,FEMALE
175
+ Chinstrap,Dream,48.5,17.5,191,3400,MALE
176
+ Chinstrap,Dream,43.2,16.6,187,2900,FEMALE
177
+ Chinstrap,Dream,50.6,19.4,193,3800,MALE
178
+ Chinstrap,Dream,46.7,17.9,195,3300,FEMALE
179
+ Chinstrap,Dream,52,19,197,4150,MALE
180
+ Chinstrap,Dream,50.5,18.4,200,3400,FEMALE
181
+ Chinstrap,Dream,49.5,19,200,3800,MALE
182
+ Chinstrap,Dream,46.4,17.8,191,3700,FEMALE
183
+ Chinstrap,Dream,52.8,20,205,4550,MALE
184
+ Chinstrap,Dream,40.9,16.6,187,3200,FEMALE
185
+ Chinstrap,Dream,54.2,20.8,201,4300,MALE
186
+ Chinstrap,Dream,42.5,16.7,187,3350,FEMALE
187
+ Chinstrap,Dream,51,18.8,203,4100,MALE
188
+ Chinstrap,Dream,49.7,18.6,195,3600,MALE
189
+ Chinstrap,Dream,47.5,16.8,199,3900,FEMALE
190
+ Chinstrap,Dream,47.6,18.3,195,3850,FEMALE
191
+ Chinstrap,Dream,52,20.7,210,4800,MALE
192
+ Chinstrap,Dream,46.9,16.6,192,2700,FEMALE
193
+ Chinstrap,Dream,53.5,19.9,205,4500,MALE
194
+ Chinstrap,Dream,49,19.5,210,3950,MALE
195
+ Chinstrap,Dream,46.2,17.5,187,3650,FEMALE
196
+ Chinstrap,Dream,50.9,19.1,196,3550,MALE
197
+ Chinstrap,Dream,45.5,17,196,3500,FEMALE
198
+ Chinstrap,Dream,50.9,17.9,196,3675,FEMALE
199
+ Chinstrap,Dream,50.8,18.5,201,4450,MALE
200
+ Chinstrap,Dream,50.1,17.9,190,3400,FEMALE
201
+ Chinstrap,Dream,49,19.6,212,4300,MALE
202
+ Chinstrap,Dream,51.5,18.7,187,3250,MALE
203
+ Chinstrap,Dream,49.8,17.3,198,3675,FEMALE
204
+ Chinstrap,Dream,48.1,16.4,199,3325,FEMALE
205
+ Chinstrap,Dream,51.4,19,201,3950,MALE
206
+ Chinstrap,Dream,45.7,17.3,193,3600,FEMALE
207
+ Chinstrap,Dream,50.7,19.7,203,4050,MALE
208
+ Chinstrap,Dream,42.5,17.3,187,3350,FEMALE
209
+ Chinstrap,Dream,52.2,18.8,197,3450,MALE
210
+ Chinstrap,Dream,45.2,16.6,191,3250,FEMALE
211
+ Chinstrap,Dream,49.3,19.9,203,4050,MALE
212
+ Chinstrap,Dream,50.2,18.8,202,3800,MALE
213
+ Chinstrap,Dream,45.6,19.4,194,3525,FEMALE
214
+ Chinstrap,Dream,51.9,19.5,206,3950,MALE
215
+ Chinstrap,Dream,46.8,16.5,189,3650,FEMALE
216
+ Chinstrap,Dream,45.7,17,195,3650,FEMALE
217
+ Chinstrap,Dream,55.8,19.8,207,4000,MALE
218
+ Chinstrap,Dream,43.5,18.1,202,3400,FEMALE
219
+ Chinstrap,Dream,49.6,18.2,193,3775,MALE
220
+ Chinstrap,Dream,50.8,19,210,4100,MALE
221
+ Chinstrap,Dream,50.2,18.7,198,3775,FEMALE
222
+ Gentoo,Biscoe,46.1,13.2,211,4500,FEMALE
223
+ Gentoo,Biscoe,50,16.3,230,5700,MALE
224
+ Gentoo,Biscoe,48.7,14.1,210,4450,FEMALE
225
+ Gentoo,Biscoe,50,15.2,218,5700,MALE
226
+ Gentoo,Biscoe,47.6,14.5,215,5400,MALE
227
+ Gentoo,Biscoe,46.5,13.5,210,4550,FEMALE
228
+ Gentoo,Biscoe,45.4,14.6,211,4800,FEMALE
229
+ Gentoo,Biscoe,46.7,15.3,219,5200,MALE
230
+ Gentoo,Biscoe,43.3,13.4,209,4400,FEMALE
231
+ Gentoo,Biscoe,46.8,15.4,215,5150,MALE
232
+ Gentoo,Biscoe,40.9,13.7,214,4650,FEMALE
233
+ Gentoo,Biscoe,49,16.1,216,5550,MALE
234
+ Gentoo,Biscoe,45.5,13.7,214,4650,FEMALE
235
+ Gentoo,Biscoe,48.4,14.6,213,5850,MALE
236
+ Gentoo,Biscoe,45.8,14.6,210,4200,FEMALE
237
+ Gentoo,Biscoe,49.3,15.7,217,5850,MALE
238
+ Gentoo,Biscoe,42,13.5,210,4150,FEMALE
239
+ Gentoo,Biscoe,49.2,15.2,221,6300,MALE
240
+ Gentoo,Biscoe,46.2,14.5,209,4800,FEMALE
241
+ Gentoo,Biscoe,48.7,15.1,222,5350,MALE
242
+ Gentoo,Biscoe,50.2,14.3,218,5700,MALE
243
+ Gentoo,Biscoe,45.1,14.5,215,5000,FEMALE
244
+ Gentoo,Biscoe,46.5,14.5,213,4400,FEMALE
245
+ Gentoo,Biscoe,46.3,15.8,215,5050,MALE
246
+ Gentoo,Biscoe,42.9,13.1,215,5000,FEMALE
247
+ Gentoo,Biscoe,46.1,15.1,215,5100,MALE
248
+ Gentoo,Biscoe,44.5,14.3,216,4100,
249
+ Gentoo,Biscoe,47.8,15,215,5650,MALE
250
+ Gentoo,Biscoe,48.2,14.3,210,4600,FEMALE
251
+ Gentoo,Biscoe,50,15.3,220,5550,MALE
252
+ Gentoo,Biscoe,47.3,15.3,222,5250,MALE
253
+ Gentoo,Biscoe,42.8,14.2,209,4700,FEMALE
254
+ Gentoo,Biscoe,45.1,14.5,207,5050,FEMALE
255
+ Gentoo,Biscoe,59.6,17,230,6050,MALE
256
+ Gentoo,Biscoe,49.1,14.8,220,5150,FEMALE
257
+ Gentoo,Biscoe,48.4,16.3,220,5400,MALE
258
+ Gentoo,Biscoe,42.6,13.7,213,4950,FEMALE
259
+ Gentoo,Biscoe,44.4,17.3,219,5250,MALE
260
+ Gentoo,Biscoe,44,13.6,208,4350,FEMALE
261
+ Gentoo,Biscoe,48.7,15.7,208,5350,MALE
262
+ Gentoo,Biscoe,42.7,13.7,208,3950,FEMALE
263
+ Gentoo,Biscoe,49.6,16,225,5700,MALE
264
+ Gentoo,Biscoe,45.3,13.7,210,4300,FEMALE
265
+ Gentoo,Biscoe,49.6,15,216,4750,MALE
266
+ Gentoo,Biscoe,50.5,15.9,222,5550,MALE
267
+ Gentoo,Biscoe,43.6,13.9,217,4900,FEMALE
268
+ Gentoo,Biscoe,45.5,13.9,210,4200,FEMALE
269
+ Gentoo,Biscoe,50.5,15.9,225,5400,MALE
270
+ Gentoo,Biscoe,44.9,13.3,213,5100,FEMALE
271
+ Gentoo,Biscoe,45.2,15.8,215,5300,MALE
272
+ Gentoo,Biscoe,46.6,14.2,210,4850,FEMALE
273
+ Gentoo,Biscoe,48.5,14.1,220,5300,MALE
274
+ Gentoo,Biscoe,45.1,14.4,210,4400,FEMALE
275
+ Gentoo,Biscoe,50.1,15,225,5000,MALE
276
+ Gentoo,Biscoe,46.5,14.4,217,4900,FEMALE
277
+ Gentoo,Biscoe,45,15.4,220,5050,MALE
278
+ Gentoo,Biscoe,43.8,13.9,208,4300,FEMALE
279
+ Gentoo,Biscoe,45.5,15,220,5000,MALE
280
+ Gentoo,Biscoe,43.2,14.5,208,4450,FEMALE
281
+ Gentoo,Biscoe,50.4,15.3,224,5550,MALE
282
+ Gentoo,Biscoe,45.3,13.8,208,4200,FEMALE
283
+ Gentoo,Biscoe,46.2,14.9,221,5300,MALE
284
+ Gentoo,Biscoe,45.7,13.9,214,4400,FEMALE
285
+ Gentoo,Biscoe,54.3,15.7,231,5650,MALE
286
+ Gentoo,Biscoe,45.8,14.2,219,4700,FEMALE
287
+ Gentoo,Biscoe,49.8,16.8,230,5700,MALE
288
+ Gentoo,Biscoe,46.2,14.4,214,4650,
289
+ Gentoo,Biscoe,49.5,16.2,229,5800,MALE
290
+ Gentoo,Biscoe,43.5,14.2,220,4700,FEMALE
291
+ Gentoo,Biscoe,50.7,15,223,5550,MALE
292
+ Gentoo,Biscoe,47.7,15,216,4750,FEMALE
293
+ Gentoo,Biscoe,46.4,15.6,221,5000,MALE
294
+ Gentoo,Biscoe,48.2,15.6,221,5100,MALE
295
+ Gentoo,Biscoe,46.5,14.8,217,5200,FEMALE
296
+ Gentoo,Biscoe,46.4,15,216,4700,FEMALE
297
+ Gentoo,Biscoe,48.6,16,230,5800,MALE
298
+ Gentoo,Biscoe,47.5,14.2,209,4600,FEMALE
299
+ Gentoo,Biscoe,51.1,16.3,220,6000,MALE
300
+ Gentoo,Biscoe,45.2,13.8,215,4750,FEMALE
301
+ Gentoo,Biscoe,45.2,16.4,223,5950,MALE
302
+ Gentoo,Biscoe,49.1,14.5,212,4625,FEMALE
303
+ Gentoo,Biscoe,52.5,15.6,221,5450,MALE
304
+ Gentoo,Biscoe,47.4,14.6,212,4725,FEMALE
305
+ Gentoo,Biscoe,50,15.9,224,5350,MALE
306
+ Gentoo,Biscoe,44.9,13.8,212,4750,FEMALE
307
+ Gentoo,Biscoe,50.8,17.3,228,5600,MALE
308
+ Gentoo,Biscoe,43.4,14.4,218,4600,FEMALE
309
+ Gentoo,Biscoe,51.3,14.2,218,5300,MALE
310
+ Gentoo,Biscoe,47.5,14,212,4875,FEMALE
311
+ Gentoo,Biscoe,52.1,17,230,5550,MALE
312
+ Gentoo,Biscoe,47.5,15,218,4950,FEMALE
313
+ Gentoo,Biscoe,52.2,17.1,228,5400,MALE
314
+ Gentoo,Biscoe,45.5,14.5,212,4750,FEMALE
315
+ Gentoo,Biscoe,49.5,16.1,224,5650,MALE
316
+ Gentoo,Biscoe,44.5,14.7,214,4850,FEMALE
317
+ Gentoo,Biscoe,50.8,15.7,226,5200,MALE
318
+ Gentoo,Biscoe,49.4,15.8,216,4925,MALE
319
+ Gentoo,Biscoe,46.9,14.6,222,4875,FEMALE
320
+ Gentoo,Biscoe,48.4,14.4,203,4625,FEMALE
321
+ Gentoo,Biscoe,51.1,16.5,225,5250,MALE
322
+ Gentoo,Biscoe,48.5,15,219,4850,FEMALE
323
+ Gentoo,Biscoe,55.9,17,228,5600,MALE
324
+ Gentoo,Biscoe,47.2,15.5,215,4975,FEMALE
325
+ Gentoo,Biscoe,49.1,15,228,5500,MALE
326
+ Gentoo,Biscoe,47.3,13.8,216,4725,
327
+ Gentoo,Biscoe,46.8,16.1,215,5500,MALE
328
+ Gentoo,Biscoe,41.7,14.7,210,4700,FEMALE
329
+ Gentoo,Biscoe,53.4,15.8,219,5500,MALE
330
+ Gentoo,Biscoe,43.3,14,208,4575,FEMALE
331
+ Gentoo,Biscoe,48.1,15.1,209,5500,MALE
332
+ Gentoo,Biscoe,50.5,15.2,216,5000,FEMALE
333
+ Gentoo,Biscoe,49.8,15.9,229,5950,MALE
334
+ Gentoo,Biscoe,43.5,15.2,213,4650,FEMALE
335
+ Gentoo,Biscoe,51.5,16.3,230,5500,MALE
336
+ Gentoo,Biscoe,46.2,14.1,217,4375,FEMALE
337
+ Gentoo,Biscoe,55.1,16,230,5850,MALE
338
+ Gentoo,Biscoe,44.5,15.7,217,4875,
339
+ Gentoo,Biscoe,48.8,16.2,222,6000,MALE
340
+ Gentoo,Biscoe,47.2,13.7,214,4925,FEMALE
341
+ Gentoo,Biscoe,,,,,
342
+ Gentoo,Biscoe,46.8,14.3,215,4850,FEMALE
343
+ Gentoo,Biscoe,50.4,15.7,222,5750,MALE
344
+ Gentoo,Biscoe,45.2,14.8,212,5200,FEMALE
345
+ Gentoo,Biscoe,49.9,16.1,213,5400,MALE
duckdb/01_getting_started.py CHANGED
@@ -2,14 +2,13 @@
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
  # "marimo",
5
- # "duckdb==1.3.2",
6
- # "polars==1.17.1",
7
- # "numpy==2.2.4",
8
- # "pyarrow==19.0.1",
9
- # "pandas==2.2.3",
10
- # "sqlglot==26.12.1",
11
- # "plotly==5.24.1",
12
- # "statsmodels==0.14.4",
13
  # ]
14
  # ///
15
 
@@ -32,9 +31,7 @@ def _(mo):
32
  @app.cell(hide_code=True)
33
  def _(mo):
34
  mo.md(rf"""
35
- # 🦆 **DuckDB**: An Embeddable Analytical Database System
36
-
37
- ## What is DuckDB?
38
 
39
  [DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.
40
 
 
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
  # "marimo",
5
+ # "duckdb==1.4.4",
6
+ # "numpy==2.4.3",
7
+ # "pandas==2.3.2",
8
+ # "plotly[express]==6.3.0",
9
+ # "polars[pyarrow]==1.24.0",
10
+ # "sqlglot==27.0.0",
11
+ # "statsmodels==0.14.5",
 
12
  # ]
13
  # ///
14
 
 
31
  @app.cell(hide_code=True)
32
  def _(mo):
33
  mo.md(rf"""
34
+ # What is DuckDB?
 
 
35
 
36
  [DuckDB](https://duckdb.org/) is a _high-performance_, in-process, embeddable SQL OLAP (Online Analytical Processing) Database Management System (DBMS) designed for simplicity and speed. It's essentially a fully-featured database that runs directly within your application's process, without needing a separate server. This makes it excellent for complex analytical workloads, offering a robust SQL interface and efficient processing – perfect for learning about databases and data analysis concepts. It's a great alternative to heavier database systems like PostgreSQL or MySQL when you don't need a full-blown server.
37
 
duckdb/{008_loading_parquet.py → 08_loading_parquet.py} RENAMED
@@ -2,9 +2,9 @@
2
  # requires-python = ">=3.10"
3
  # dependencies = [
4
  # "marimo",
5
- # "duckdb==1.3.2",
6
- # "pyarrow==19.0.1",
7
- # "plotly.express",
8
  # "sqlglot==27.0.0",
9
  # ]
10
  # ///
 
2
  # requires-python = ">=3.10"
3
  # dependencies = [
4
  # "marimo",
5
+ # "duckdb==1.4.4",
6
+ # "polars[pyarrow]==1.24.0",
7
+ # "plotly[express]==6.3.0",
8
  # "sqlglot==27.0.0",
9
  # ]
10
  # ///
duckdb/{009_loading_json.py → 09_loading_json.py} RENAMED
@@ -2,9 +2,9 @@
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
  # "marimo",
5
- # "duckdb==1.3.2",
6
- # "sqlglot==26.11.1",
7
- # "polars[pyarrow]==1.25.2",
8
  # ]
9
  # ///
10
 
 
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
  # "marimo",
5
+ # "duckdb==1.4.4",
6
+ # "polars[pyarrow]==1.24.0",
7
+ # "sqlglot==27.0.0",
8
  # ]
9
  # ///
10
 
duckdb/{011_working_with_apache_arrow.py → 11_working_with_apache_arrow.py} RENAMED
@@ -2,13 +2,11 @@
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
  # "marimo",
5
- # "duckdb==1.3.2",
6
- # "pyarrow==19.0.1",
7
- # "polars[pyarrow]==1.25.2",
8
- # "pandas==2.2.3",
9
  # "sqlglot==27.0.0",
10
- # "psutil==7.0.0",
11
- # "altair",
12
  # ]
13
  # ///
14
 
@@ -534,15 +532,8 @@ def _(mo):
534
 
535
 
536
  @app.cell
537
- def _(polars_data, psutil, time):
538
- import os
539
- import pyarrow.compute as pc # Add this import
540
-
541
- # Get current process
542
- process = psutil.Process(os.getpid())
543
-
544
- # Measure memory before operations
545
- memory_before = process.memory_info().rss / 1024 / 1024 # MB
546
 
547
  # Perform multiple Arrow-based operations (zero-copy)
548
  latest_start_time = time.time()
@@ -550,11 +541,9 @@ def _(polars_data, psutil, time):
550
  # These operations use Arrow's zero-copy capabilities
551
  arrow_table = polars_data.to_arrow()
552
  arrow_sliced = arrow_table.slice(0, 100000)
553
- # Use PyArrow compute functions for filtering
554
  arrow_filtered = arrow_table.filter(pc.greater(arrow_table['value'], 500000))
555
 
556
  arrow_ops_time = time.time() - latest_start_time
557
- memory_after_arrow = process.memory_info().rss / 1024 / 1024 # MB
558
 
559
  # Compare with traditional copy-based operations
560
  latest_start_time = time.time()
@@ -565,16 +554,21 @@ def _(polars_data, psutil, time):
565
  pandas_filtered = pandas_copy[pandas_copy['value'] > 500000].copy()
566
 
567
  copy_ops_time = time.time() - latest_start_time
568
- memory_after_copy = process.memory_info().rss / 1024 / 1024 # MB
569
 
570
- print("Memory Usage Comparison:")
571
- print(f"Initial memory: {memory_before:.2f} MB")
572
- print(f"After Arrow operations: {memory_after_arrow:.2f} MB (diff: +{memory_after_arrow - memory_before:.2f} MB)")
573
- print(f"After copy operations: {memory_after_copy:.2f} MB (diff: +{memory_after_copy - memory_before:.2f} MB)")
574
- print(f"\nTime comparison:")
575
- print(f"Arrow operations: {arrow_ops_time:.3f} seconds")
576
- print(f"Copy operations: {copy_ops_time:.3f} seconds")
577
- print(f"Speedup: {copy_ops_time/arrow_ops_time:.1f}x")
 
 
 
 
 
 
578
  return
579
 
580
 
@@ -608,8 +602,7 @@ def _():
608
  import pandas as pd
609
  import duckdb
610
  import sqlglot
611
- import psutil
612
- return duckdb, mo, pa, pd, pl, psutil
613
 
614
 
615
  if __name__ == "__main__":
 
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
  # "marimo",
5
+ # "altair==6.0.0",
6
+ # "duckdb==1.4.4",
7
+ # "pandas==2.3.2",
8
+ # "polars[pyarrow]==1.24.0",
9
  # "sqlglot==27.0.0",
 
 
10
  # ]
11
  # ///
12
 
 
532
 
533
 
534
  @app.cell
535
+ def _(mo, polars_data, time):
536
+ import pyarrow.compute as pc
 
 
 
 
 
 
 
537
 
538
  # Perform multiple Arrow-based operations (zero-copy)
539
  latest_start_time = time.time()
 
541
  # These operations use Arrow's zero-copy capabilities
542
  arrow_table = polars_data.to_arrow()
543
  arrow_sliced = arrow_table.slice(0, 100000)
 
544
  arrow_filtered = arrow_table.filter(pc.greater(arrow_table['value'], 500000))
545
 
546
  arrow_ops_time = time.time() - latest_start_time
 
547
 
548
  # Compare with traditional copy-based operations
549
  latest_start_time = time.time()
 
554
  pandas_filtered = pandas_copy[pandas_copy['value'] > 500000].copy()
555
 
556
  copy_ops_time = time.time() - latest_start_time
 
557
 
558
+ mo.vstack([
559
+ mo.md(f"""
560
+ **Time comparison:**
561
+
562
+ | Method | Time (s) |
563
+ |--------|----------|
564
+ | Arrow operations | {arrow_ops_time:.3f} |
565
+ | Copy operations | {copy_ops_time:.3f} |
566
+ | Speedup | {copy_ops_time/arrow_ops_time:.1f}x |
567
+
568
+ > **Note:** Memory usage statistics are not available in this environment.
569
+ > Arrow's zero-copy design typically uses 20–40% less memory than Pandas copies.
570
+ """),
571
+ ])
572
  return
573
 
574
 
 
602
  import pandas as pd
603
  import duckdb
604
  import sqlglot
605
+ return duckdb, mo, pa, pd, pl
 
606
 
607
 
608
  if __name__ == "__main__":
duckdb/DuckDB_Loading_CSVs.py CHANGED
@@ -2,12 +2,11 @@
2
  # requires-python = ">=3.10"
3
  # dependencies = [
4
  # "marimo",
5
- # "plotly.express",
6
  # "plotly==6.0.1",
7
- # "duckdb==1.3.2",
8
- # "sqlglot==26.11.1",
9
- # "pyarrow==19.0.1",
10
  # "polars==1.27.1",
 
 
11
  # ]
12
  # ///
13
 
 
2
  # requires-python = ">=3.10"
3
  # dependencies = [
4
  # "marimo",
5
+ # "duckdb==1.4.4",
6
  # "plotly==6.0.1",
 
 
 
7
  # "polars==1.27.1",
8
+ # "pyarrow==19.0.1",
9
+ # "sqlglot==27.0.0",
10
  # ]
11
  # ///
12
 
duckdb/README.md DELETED
@@ -1,37 +0,0 @@
1
- ---
2
- title: Readme
3
- marimo-version: 0.18.4
4
- ---
5
-
6
- # Learn DuckDB
7
-
8
- _🚧 This collection is a work in progress. Please help us add notebooks!_
9
-
10
- This collection of marimo notebooks is designed to teach you the basics of
11
- DuckDB, a fast in-memory OLAP engine that can interoperate with Dataframes.
12
- These notebooks also show how marimo gives DuckDB superpowers.
13
-
14
- **Help us build this course! ⚒️**
15
-
16
- We're seeking contributors to help us build these notebooks. Every contributor
17
- will be acknowledged as an author in this README and in their contributed
18
- notebooks. Head over to the [tracking
19
- issue](https://github.com/marimo-team/learn/issues/48) to sign up for a planned
20
- notebook or propose your own.
21
-
22
- **Running notebooks.** To run a notebook locally, use
23
-
24
- ```bash
25
- uvx marimo edit <file_url>
26
- ```
27
-
28
- You can also open notebooks in our online playground by appending marimo.app/ to a notebook's URL.
29
-
30
-
31
- **Authors.**
32
-
33
- Thanks to all our notebook authors!
34
-
35
- * [Mustjaab](https://github.com/Mustjaab)
36
- * [julius383](https://github.com/julius383)
37
- * [thliang01](https://github.com/thliang01)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
duckdb/index.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Learn DuckDB
3
+ description: >
4
+ These notebooks teach you the basics of DuckDB,
5
+ a fast in-memory database engine that can interoperate
6
+ with dataframes, and show how marimo gives DuckDB superpowers.
7
+ tracking: 48
8
+ ---
9
+
10
+ ## Contributors
11
+
12
+ Thanks to our notebook authors:
13
+
14
+ * [Mustjaab](https://github.com/Mustjaab)
15
+ * [julius383](https://github.com/julius383)
16
+ * [thliang01](https://github.com/thliang01)
{functional_programming → functional}/05_functors.py RENAMED
@@ -875,7 +875,7 @@ def _(mo):
875
 
876
  @app.cell(hide_code=True)
877
  def _(mo):
878
- mo.md("""
879
  ## Functor laws, again
880
 
881
  Once again there are a few axioms that functors have to obey.
 
875
 
876
  @app.cell(hide_code=True)
877
  def _(mo):
878
+ mo.md(r"""
879
  ## Functor laws, again
880
 
881
  Once again there are a few axioms that functors have to obey.
{functional_programming → functional}/06_applicatives.py RENAMED
@@ -14,7 +14,7 @@ app = marimo.App(app_title="Applicative programming with effects")
14
  @app.cell(hide_code=True)
15
  def _(mo):
16
  mo.md(r"""
17
- # Applicative programming with effects
18
 
19
  `Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style.
20
 
 
14
  @app.cell(hide_code=True)
15
  def _(mo):
16
  mo.md(r"""
17
+ # Applicative Programming with Effects
18
 
19
  `Applicative Functor` encapsulates certain sorts of *effectful* computations in a functionally pure way, and encourages an *applicative* programming style.
20
 
functional/_index.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Learn Functional Programming
3
+ description: >
4
+ These notebooks introduce powerful ideas from functional programming
5
+ in Python, taking inspiration from Haskell and category theory.
6
+ tracking: 51
7
+ ---
8
+
9
+ Using only Python's standard library, these lessons construct
10
+ functional programming concepts from first principles.
11
+ Topics include:
12
+
13
+ - Currying and higher-order functions
14
+ - Functors, Applicatives, and Monads
15
+ - Category theory fundamentals
16
+
17
+ ## Contributors
18
+
19
+ Thanks to our notebook authors:
20
+
21
+ - métaboulie
22
+
23
+ and reviewers:
24
+
25
+ - [Srihari Thyagarajan](https://github.com/Haleshot)
functional_programming/CHANGELOG.md DELETED
@@ -1,129 +0,0 @@
1
- ---
2
- title: Changelog
3
- marimo-version: 0.18.4
4
- ---
5
-
6
- # Changelog of the functional-programming course
7
-
8
- ## 2025-04-16
9
-
10
- **applicatives.py**
11
-
12
- - replace `return NotImplementedError` with `raise NotImplementedError`
13
-
14
- - add `Either` applicative
15
- - Add `Alternative`
16
-
17
- ## 2025-04-11
18
-
19
- **functors.py**
20
-
21
- - add `Bifunctor` section
22
-
23
- - replace `return NotImplementedError` with `raise NotImplementedError`
24
-
25
- ## 2025-04-08
26
-
27
- **functors.py**
28
-
29
- - restructure the notebook
30
- - replace `f` in the function signatures with `g` to indicate regular functions and
31
- distinguish from functors
32
- - move `Maybe` funtor to section `More Functor instances`
33
-
34
- - add `Either` functor
35
-
36
- - add `unzip` utility function for functors
37
-
38
- ## 2025-04-07
39
-
40
- **applicatives.py**
41
-
42
- - the `apply` method of `Maybe` _Applicative_ should return `None` when `fg` or `fa` is
43
- `None`
44
-
45
- - add `sequenceL` as a classmethod for `Applicative` and add examples for `Wrapper`,
46
- `Maybe`, `List`
47
- - add description for utility functions of `Applicative`
48
-
49
- - refine the implementation of `IO` _Applicative_
50
- - reimplement `get_chars` with `IO.sequenceL`
51
-
52
- - add an example to show that `ListMonoidal` is equivalent to `List` _Applicative_
53
-
54
- ## 2025-04-06
55
-
56
- **applicatives.py**
57
-
58
- - remove `sequenceL` from `Applicative` because it should be a classmethod but can't be
59
- generically implemented
60
-
61
- ## 2025-04-02
62
-
63
- **functors.py**
64
-
65
- - Migrate to `python3.13`
66
-
67
- - Replace all occurrences of
68
-
69
- ```python
70
- class Functor(Generic[A])
71
- ```
72
-
73
- with
74
-
75
- ```python
76
- class Functor[A]
77
- ```
78
-
79
- for conciseness
80
-
81
- - Use `fa` in function signatures instead of `a` when `fa` is a _Functor_
82
-
83
- **applicatives.py**
84
-
85
- - `0.1.0` version of notebook `06_applicatives.py`
86
-
87
- ## 2025-03-16
88
-
89
- **functors.py**
90
-
91
- - Use uppercased letters for `Generic` types, e.g. `A = TypeVar("A")`
92
- - Refactor the `Functor` class, changing `fmap` and utility methods to `classmethod`
93
-
94
- For example:
95
-
96
- ```python
97
- @dataclass
98
- class Wrapper(Functor, Generic[A]):
99
- value: A
100
-
101
- @classmethod
102
- def fmap(cls, f: Callable[[A], B], a: "Wrapper[A]") -> "Wrapper[B]":
103
- return Wrapper(f(a.value))
104
-
105
- >>> Wrapper.fmap(lambda x: x + 1, wrapper)
106
- Wrapper(value=2)
107
- ```
108
-
109
- - Move the `check_functor_law` method from `Functor` class to a standard function
110
-
111
- - Rename `ListWrapper` to `List` for simplicity
112
- - Remove the `Just` class
113
-
114
- - Rewrite proofs
115
-
116
- ## 2025-03-13
117
-
118
- **functors.py**
119
-
120
- - `0.1.0` version of notebook `05_functors`
121
-
122
- Thank [Akshay](https://github.com/akshayka) and [Haleshot](https://github.com/Haleshot)
123
- for reviewing
124
-
125
- ## 2025-03-11
126
-
127
- **functors.py**
128
-
129
- - Demo version of notebook `05_functors.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
functional_programming/README.md DELETED
@@ -1,77 +0,0 @@
1
- ---
2
- title: Readme
3
- marimo-version: 0.18.4
4
- ---
5
-
6
- # Learn Functional Programming
7
-
8
- _🚧 This collection is a [work in progress](https://github.com/marimo-team/learn/issues/51)._
9
-
10
- This series of marimo notebooks introduces the powerful paradigm of functional
11
- programming through Python. Taking inspiration from Haskell and Category
12
- Theory, we'll build a strong foundation in FP concepts that can transform how
13
- you approach software development.
14
-
15
- ## What You'll Learn
16
-
17
- **Using only Python's standard library**, we'll construct functional
18
- programming concepts from first principles.
19
-
20
- Topics include:
21
-
22
- + Currying and higher-order functions
23
- + Functors, Applicatives, and Monads
24
- + Category theory fundamentals
25
-
26
- ## Running Notebooks
27
-
28
- ### Locally
29
-
30
- To run a notebook locally, use
31
-
32
- ```bash
33
- uvx marimo edit <URL>
34
- ```
35
-
36
- For example, run the `Functor` tutorial with
37
-
38
- ```bash
39
- uvx marimo edit https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py
40
- ```
41
-
42
- ### On Our Online Playground
43
-
44
- You can also open notebooks in our online playground by appending `marimo.app/` to a notebook's URL like:
45
-
46
- https://marimo.app/https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py
47
-
48
- ### On Our Landing Page
49
-
50
- Open the notebooks in our landing page page [here](https://marimo-team.github.io/learn/functional_programming/05_functors.html)
51
-
52
- ## Collaboration
53
-
54
- If you're interested in collaborating or have questions, please reach out to me
55
- on Discord (@eugene.hs).
56
-
57
- ## Description of notebooks
58
-
59
- Check [here](https://github.com/marimo-team/learn/issues/51) for current series
60
- structure.
61
-
62
- | Notebook | Title | Key Concepts | Prerequisites |
63
- |----------|-------|--------------|---------------|
64
- | [05. Functors](https://github.com/marimo-team/learn/blob/main/functional_programming/05_functors.py) | Category Theory and Functors | Category Theory, Functor, fmap, Bifunctor | Basic Python, Functions |
65
- | [06. Applicatives](https://github.com/marimo-team/learn/blob/main/functional_programming/06_applicatives.py) | Applicative programming with effects | Applicative Functor, pure, apply, Effectful programming, Alternative | Functors |
66
-
67
- **Authors.**
68
-
69
- Thanks to all our notebook authors!
70
-
71
- - [métaboulie](https://github.com/metaboulie)
72
-
73
- **Reviewers.**
74
-
75
- Thanks to all our notebook reviews!
76
-
77
- - [Haleshot](https://github.com/Haleshot)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
optimization/01_least_squares.py CHANGED
@@ -1,9 +1,9 @@
1
  # /// script
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
- # "cvxpy==1.6.0",
5
  # "marimo",
6
- # "numpy==2.2.2",
7
  # ]
8
  # ///
9
 
@@ -22,7 +22,7 @@ def _():
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
  mo.md(r"""
25
- # Least squares
26
 
27
  In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times
28
  n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector
 
1
  # /// script
2
  # requires-python = ">=3.11"
3
  # dependencies = [
4
+ # "cvxpy-base",
5
  # "marimo",
6
+ # "numpy==2.4.3",
7
  # ]
8
  # ///
9
 
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
  mo.md(r"""
25
+ # Least Squares
26
 
27
  In a least-squares problem, we have measurements $A \in \mathcal{R}^{m \times
28
  n}$ (i.e., $m$ rows and $n$ columns) and $b \in \mathcal{R}^m$. We seek a vector
optimization/02_linear_program.py CHANGED
@@ -1,11 +1,11 @@
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
- # "cvxpy==1.6.0",
5
  # "marimo",
6
- # "matplotlib==3.10.0",
7
- # "numpy==2.2.2",
8
- # "wigglystuff==0.1.9",
9
  # ]
10
  # ///
11
 
@@ -24,7 +24,7 @@ def _():
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
  mo.md(r"""
27
- # Linear program
28
 
29
  A linear program is an optimization problem with a linear objective and affine
30
  inequality constraints. A common standard form is the following:
 
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
+ # "cvxpy-base",
5
  # "marimo",
6
+ # "matplotlib==3.10.8",
7
+ # "numpy==2.4.3",
8
+ # "wigglystuff==0.2.37",
9
  # ]
10
  # ///
11
 
 
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
  mo.md(r"""
27
+ # Linear Program
28
 
29
  A linear program is an optimization problem with a linear objective and affine
30
  inequality constraints. A common standard form is the following:
optimization/03_minimum_fuel_optimal_control.py CHANGED
@@ -1,7 +1,11 @@
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
 
4
  # "marimo",
 
 
 
5
  # ]
6
  # ///
7
  import marimo
@@ -19,7 +23,7 @@ def _():
19
  @app.cell(hide_code=True)
20
  def _(mo):
21
  mo.md(r"""
22
- # Minimal fuel optimal control
23
 
24
  This notebook includes an application of linear programming to controlling a
25
  physical system, adapted from [Convex
@@ -128,14 +132,14 @@ def _():
128
 
129
 
130
  @app.cell
131
- def _(A, T, b, cp, mo, n, x0, xdes):
132
  X, u = cp.Variable(shape=(n, T + 1)), cp.Variable(shape=(1, T))
133
 
134
  objective = cp.sum(cp.maximum(cp.abs(u), 2 * cp.abs(u) - 1))
135
  constraints = [
136
  X[:, 1:] == A @ X[:, :-1] + b @ u,
137
- X[:, 0] == x0,
138
- X[:, -1] == xdes,
139
  ]
140
 
141
  fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve()
 
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
+ # "cvxpy-base",
5
  # "marimo",
6
+ # "matplotlib==3.10.8",
7
+ # "numpy==2.4.3",
8
+ # "wigglystuff==0.2.37",
9
  # ]
10
  # ///
11
  import marimo
 
23
  @app.cell(hide_code=True)
24
  def _(mo):
25
  mo.md(r"""
26
+ # Minimal Fuel Optimal Control
27
 
28
  This notebook includes an application of linear programming to controlling a
29
  physical system, adapted from [Convex
 
132
 
133
 
134
  @app.cell
135
+ def _(A, T, b, cp, mo, n, np, x0, xdes):
136
  X, u = cp.Variable(shape=(n, T + 1)), cp.Variable(shape=(1, T))
137
 
138
  objective = cp.sum(cp.maximum(cp.abs(u), 2 * cp.abs(u) - 1))
139
  constraints = [
140
  X[:, 1:] == A @ X[:, :-1] + b @ u,
141
+ X[:, 0] == np.array(x0).flatten(),
142
+ X[:, -1] == np.array(xdes).flatten(),
143
  ]
144
 
145
  fuel_used = cp.Problem(cp.Minimize(objective), constraints).solve()
optimization/04_quadratic_program.py CHANGED
@@ -1,11 +1,11 @@
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
- # "cvxpy==1.6.0",
5
  # "marimo",
6
- # "matplotlib==3.10.0",
7
- # "numpy==2.2.2",
8
- # "wigglystuff==0.1.9",
9
  # ]
10
  # ///
11
 
@@ -24,7 +24,7 @@ def _():
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
  mo.md(r"""
27
- # Quadratic program
28
 
29
  A quadratic program is an optimization problem with a quadratic objective and
30
  affine equality and inequality constraints. A common standard form is the
 
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
+ # "cvxpy-base",
5
  # "marimo",
6
+ # "matplotlib==3.10.8",
7
+ # "numpy==2.4.3",
8
+ # "wigglystuff==0.2.37",
9
  # ]
10
  # ///
11
 
 
24
  @app.cell(hide_code=True)
25
  def _(mo):
26
  mo.md(r"""
27
+ # Quadratic Program
28
 
29
  A quadratic program is an optimization problem with a quadratic objective and
30
  affine equality and inequality constraints. A common standard form is the
optimization/05_portfolio_optimization.py CHANGED
@@ -1,12 +1,12 @@
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
- # "cvxpy==1.6.0",
5
  # "marimo",
6
- # "matplotlib==3.10.0",
7
- # "numpy==2.2.2",
8
- # "scipy==1.15.1",
9
- # "wigglystuff==0.1.9",
10
  # ]
11
  # ///
12
 
@@ -25,7 +25,7 @@ def _():
25
  @app.cell(hide_code=True)
26
  def _(mo):
27
  mo.md(r"""
28
- # Portfolio optimization
29
  """)
30
  return
31
 
@@ -145,7 +145,7 @@ def _(mo, np):
145
  def _(mu_widget, np):
146
  np.random.seed(1)
147
  n = 10
148
- mu = np.array(mu_widget.matrix)
149
  Sigma = np.random.randn(n, n)
150
  Sigma = Sigma.T.dot(Sigma)
151
  return Sigma, mu, n
@@ -153,7 +153,7 @@ def _(mu_widget, np):
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
- mo.md("""
157
  Next, we solve the problem for 100 different values of $\gamma$
158
  """)
159
  return
@@ -187,7 +187,7 @@ def _(cp, gamma, np, prob, ret, risk):
187
 
188
  @app.cell(hide_code=True)
189
  def _(mo):
190
- mo.md("""
191
  Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles)
192
  """)
193
  return
 
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
+ # "cvxpy-base",
5
  # "marimo",
6
+ # "matplotlib==3.10.8",
7
+ # "numpy==2.4.3",
8
+ # "scipy==1.17.1",
9
+ # "wigglystuff==0.2.37",
10
  # ]
11
  # ///
12
 
 
25
  @app.cell(hide_code=True)
26
  def _(mo):
27
  mo.md(r"""
28
+ # Portfolio Optimization
29
  """)
30
  return
31
 
 
145
  def _(mu_widget, np):
146
  np.random.seed(1)
147
  n = 10
148
+ mu = np.array(mu_widget.matrix).flatten()
149
  Sigma = np.random.randn(n, n)
150
  Sigma = Sigma.T.dot(Sigma)
151
  return Sigma, mu, n
 
153
 
154
  @app.cell(hide_code=True)
155
  def _(mo):
156
+ mo.md(r"""
157
  Next, we solve the problem for 100 different values of $\gamma$
158
  """)
159
  return
 
187
 
188
  @app.cell(hide_code=True)
189
  def _(mo):
190
+ mo.md(r"""
191
  Plotted below are the risk return tradeoffs for two values of $\gamma$ (blue squares), and the risk return tradeoffs for investing fully in each asset (red circles)
192
  """)
193
  return
optimization/06_convex_optimization.py CHANGED
@@ -1,9 +1,9 @@
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
- # "cvxpy==1.6.0",
5
  # "marimo",
6
- # "numpy==2.2.2",
7
  # ]
8
  # ///
9
 
@@ -22,7 +22,7 @@ def _():
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
  mo.md(r"""
25
- # Convex optimization
26
 
27
  In the previous tutorials, we learned about least squares, linear programming,
28
  and quadratic programming, and saw applications of each. We also learned that these problem
 
1
  # /// script
2
  # requires-python = ">=3.13"
3
  # dependencies = [
4
+ # "cvxpy-base",
5
  # "marimo",
6
+ # "numpy==2.4.3",
7
  # ]
8
  # ///
9
 
 
22
  @app.cell(hide_code=True)
23
  def _(mo):
24
  mo.md(r"""
25
+ # Convex Optimization
26
 
27
  In the previous tutorials, we learned about least squares, linear programming,
28
  and quadratic programming, and saw applications of each. We also learned that these problem