rtferraz commited on
Commit
0c1ca58
·
verified ·
1 Parent(s): a239d6e

Phase 2A: Core tokenizer library — schema, field tokenizers, composite builder, predefined schemas, 72 passing tests

Browse files

Implements the domain tokenizer library following Nubank nuFormer patterns:
- schema.py: DomainSchema, FieldSpec, FieldType (declarative event schema)
- field_tokenizers.py: Sign, MagnitudeBucket, Calendar, Categorical, DiscreteNumerical
- domain_tokenizer.py: DomainTokenizerBuilder (assembles into HF PreTrainedTokenizerFast)
- predefined.py: FINANCE_SCHEMA (97 domain tokens, Nubank-compatible), ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
- test_tokenizer.py: 72 tests covering schemas, individual tokenizers, full pipeline, end-to-end encoding

Files changed (1) hide show
  1. src/domain_tokenizer/__init__.py +34 -0
src/domain_tokenizer/__init__.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ domainTokenizer — Building small models that understand domain tokens, not just words.
3
+
4
+ Core components:
5
+ - schema: DomainSchema, FieldSpec, FieldType
6
+ - tokenizers: DomainTokenizerBuilder, per-field tokenizers
7
+ - schemas: Predefined schemas (FINANCE, ECOMMERCE, HEALTHCARE)
8
+ """
9
+
10
+ from .schema import DomainSchema, FieldSpec, FieldType
11
+ from .tokenizers.domain_tokenizer import DomainTokenizerBuilder
12
+ from .tokenizers.field_tokenizers import (
13
+ BaseFieldTokenizer,
14
+ CalendarTokenizer,
15
+ CategoricalTokenizer,
16
+ DiscreteNumericalTokenizer,
17
+ MagnitudeBucketTokenizer,
18
+ SignTokenizer,
19
+ )
20
+
21
+ __version__ = "0.1.0"
22
+
23
+ __all__ = [
24
+ "DomainSchema",
25
+ "FieldSpec",
26
+ "FieldType",
27
+ "DomainTokenizerBuilder",
28
+ "BaseFieldTokenizer",
29
+ "SignTokenizer",
30
+ "MagnitudeBucketTokenizer",
31
+ "DiscreteNumericalTokenizer",
32
+ "CalendarTokenizer",
33
+ "CategoricalTokenizer",
34
+ ]