File size: 78,018 Bytes
19e7d25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fdb42c
 
 
0438ca3
1fdb42c
0438ca3
 
 
1fdb42c
 
 
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
 
0438ca3
1fdb42c
0438ca3
 
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
1fdb42c
0438ca3
 
 
 
 
1fdb42c
 
 
0438ca3
1fdb42c
0438ca3
 
 
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
1fdb42c
 
 
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
 
 
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
1fdb42c
0438ca3
 
 
 
 
 
 
1fdb42c
 
 
0438ca3
de24794
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
# PhD Research OS β€” Complete System Design
## Version 2.0 | Post-Audit Architecture

**Date**: 2026-04-23
**Status**: DESIGN COMPLETE β€” Ready for phased implementation
**Addresses**: All 87 blindspots from the audit
**Hardware Target**: 16-24GB VRAM consumer GPU (RTX 4090 / RTX 3090 / A6000)

---

## 1. System Overview

```
╔══════════════════════════════════════════════════════════════════════════╗
β•‘                    PhD Research OS v2.0                                  β•‘
β•‘                    "The Epistemic Engine"                                β•‘
╠══════════════════════════════════════════════════════════════════════════╣
β•‘                                                                          β•‘
β•‘  β”Œβ”€β”€β”€ INPUTS ──────────────────────────────────────────────────────┐    β•‘
β•‘  β”‚  PDF Bundles β”‚ Supplements β”‚ Datasets β”‚ Code Repos β”‚ Lab Notes  β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 0: STRUCTURAL INGESTION ──────────────────────────────┐    β•‘
β•‘  β”‚  Marker β†’ Nougat β†’ GROBID β”‚ Region Classifier β”‚ Plot Digitizer β”‚    β•‘
β•‘  β”‚  Section-aware chunks β”‚ Bounding boxes β”‚ Quality scores          β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 1: ENTITY RESOLUTION ─────────────────────────────────┐    β•‘
β•‘  β”‚  Ontology normalizer β”‚ Citation resolver β”‚ VoR lineage β”‚ Retract. β”‚  β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 2: QUALIFIED EXTRACTION ──────────────────────────────┐    β•‘
β•‘  β”‚  AI Model Council (parallel) β”‚ Epistemic Separation Engine      β”‚    β•‘
β•‘  β”‚  Qualifier preservation β”‚ Statistical extraction β”‚ OOD gating   β”‚    β•‘
β•‘  β”‚  Guidance constrained decoding β”‚ Source quotes + bboxes          β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 3: CANONICALIZATION ──────────────────────────────────┐    β•‘
β•‘  β”‚  Embedding dedup β”‚ Canonical registry β”‚ Alias merging           β”‚    β•‘
β•‘  β”‚  Evidence aggregation β”‚ Temporal versioning β”‚ Lineage diff      β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 4: KNOWLEDGE GRAPH ───────────────────────────────────┐    β•‘
β•‘  β”‚  SQLite-backed graph β”‚ Typed epistemic edges β”‚ Lab lineage      β”‚    β•‘
β•‘  β”‚  Method compatibility β”‚ Transitive constraints β”‚ Gap analysis   β”‚    β•‘
β•‘  β”‚  Null evidence β”‚ Conflict clustering β”‚ Versioned ontology       β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 5: CALIBRATED SCORING ────────────────────────────────┐    β•‘
β•‘  β”‚  Code-computed confidence β”‚ 3 separate scores β”‚ Statistical gateβ”‚    β•‘
β•‘  β”‚  Parser confidence propagation β”‚ Section modifiers β”‚ Brier mon. β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 6: EVALUATION ────────────────────────────────────────┐    β•‘
β•‘  β”‚  LLM-as-Judge CI/CD β”‚ Versioned golden set β”‚ Stochastic tests  β”‚    β•‘
β•‘  β”‚  Hidden holdout β”‚ Fatigue management β”‚ Counter-metrics          β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ LAYER 7: PROVENANCE & REPRODUCIBILITY ──────────────────────┐    β•‘
β•‘  β”‚  Version pinning β”‚ Output lineage β”‚ PDF.js viewer β”‚ Containers  β”‚    β•‘
β•‘  β”‚  Security sandbox β”‚ License checking β”‚ Epistemic Embargo        β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β•‘
β•‘                              β–Ό                                          β•‘
β•‘  β”Œβ”€β”€β”€ OUTPUTS ─────────────────────────────────────────────────────┐   β•‘
β•‘  β”‚  Obsidian Vault β”‚ Courtroom UI β”‚ Gap Analysis β”‚ Decision Objectsβ”‚   β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β•‘
β•‘                                                                          β•‘
β•‘  β”Œβ”€β”€β”€ CROSS-CUTTING ──────────────────────────────────────────────┐    β•‘
β•‘  β”‚  AI Model Council β”‚ Meta-Improver β”‚ Superpowers Skills          β”‚    β•‘
β•‘  β”‚  ECC Harness β”‚ Companion Agents β”‚ Manual Synthesis Mode         β”‚    β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
```

---

## 2. Model Architecture

### 2.1 The Two-Model Strategy

The system runs TWO models, not one. This solves the local-vs-online tension:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  PRIMARY BRAIN (Fully Local β€” Never Touches Internet)    β”‚
β”‚                                                          β”‚
β”‚  Model: Qwen3-8B Q4 AWQ                                β”‚
β”‚  VRAM: ~5GB weights + ~4GB KV cache (PolarQuant)        β”‚
β”‚  Total: ~9GB (fits 16GB GPU with room for batch)        β”‚
β”‚  Context: 128K tokens (full paper length)               β”‚
β”‚  Serving: Ollama (simplest) or vLLM (fastest)           β”‚
β”‚                                                          β”‚
β”‚  Tasks:                                                  β”‚
β”‚  β€’ Claim extraction (Layer 2)                           β”‚
β”‚  β€’ Epistemic classification                              β”‚
β”‚  β€’ Confidence component estimation                       β”‚
β”‚  β€’ Conflict hypothesis generation                        β”‚
β”‚  β€’ Query decomposition                                   β”‚
β”‚  β€’ Decision object generation                            β”‚
β”‚                                                          β”‚
β”‚  Constrained decoding: Guidance engine                   β”‚
β”‚  Training: SFT β†’ DPO β†’ GRPO (4-stage pipeline)         β”‚
β”‚  Privacy: ALL paper data stays local                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  COMPANION BRAIN (Online β€” For Non-Sensitive Tasks)      β”‚
β”‚                                                          β”‚
β”‚  Model: Claude API / GPT-4o-mini / OpenRouter            β”‚
β”‚  OR: Local Qwen3-30B-A3B MoE Q4 (~6GB, 3B active)      β”‚
β”‚                                                          β”‚
β”‚  Tasks:                                                  β”‚
β”‚  β€’ Meta-Improver external scanning (arXiv, GitHub)      β”‚
β”‚  β€’ Prompt optimization A/B testing                       β”‚
β”‚  β€’ Training data generation for new domains             β”‚
β”‚  β€’ Retraction/correction checking (needs internet)      β”‚
β”‚  β€’ Repository URL validation                             β”‚
β”‚                                                          β”‚
β”‚  Privacy: NEVER sees raw paper text                      β”‚
β”‚  Only receives: metadata, queries, anonymized claims     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### 2.2 Why Qwen3-8B, Not Qwen2.5-3B

| Metric | Qwen2.5-3B | Qwen3-8B | Improvement |
|--------|-----------|----------|-------------|
| AIME (math reasoning) | ~15% | ~45%+ | 3Γ— |
| MATH-500 | ~85% | ~95%+ | +10 pts |
| JSON structural accuracy (SFT) | ~65% | ~80%+ | +15 pts |
| Context window | 32K | 128K | 4Γ— |
| Hybrid thinking mode | No | Yes | New capability |
| VRAM at Q4 AWQ | ~2.5GB | ~5GB | Acceptable |

### 2.3 Alternative: Qwen3-30B-A3B MoE (The Stealth Option)

For users with 8GB+ VRAM who want maximum quality:
- 30B total parameters, only 3B activated per token (Mixture of Experts)
- ~6GB at Q4 quantization
- Quality equivalent to dense 14B+ models
- Apache 2.0 license
- Available: `Qwen/Qwen3-30B-A3B-Instruct-2507` (1M downloads)

### 2.4 Multimodal: Qwen3-VL-8B-Instruct

For figure/diagram processing (Layer 0):
- Same architecture as text model but with vision encoder
- Available: `Qwen/Qwen3-VL-8B-Instruct` (3.9M downloads)
- AWQ 4-bit: `cyankiwi/Qwen3-VL-8B-Instruct-AWQ-4bit` (~5GB)
- Handles: figure classification, diagram understanding, micrograph analysis
- Does NOT replace plot digitizer for quantitative data

### 2.5 VLM for Multimodal Figures: Qwen3-VL-30B-A3B-Instruct

For maximum figure understanding with MoE efficiency:
- Available: `Qwen/Qwen3-VL-30B-A3B-Instruct` (1.5M downloads)
- AWQ: `QuantTrio/Qwen3-VL-30B-A3B-Instruct-AWQ` (667K downloads)
- Only 3B active params β€” fits alongside primary brain

---

## 3. Training Pipeline (4-Stage)

### Stage 1: SFT on Domain Data

```python
# Current implementation (train.py) β€” KEEP but upgrade base model
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig

trainer = SFTTrainer(
    model="Qwen/Qwen3-8B",  # Upgraded from Qwen2.5-3B
    args=SFTConfig(
        output_dir="./research-os-sft",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        max_length=4096,  # Longer for paper sections
        assistant_only_loss=True,
        bf16=True,
        gradient_checkpointing=True,
        push_to_hub=True,
        hub_model_id="nkshirsa/phd-research-os-brain-v2",
    ),
    train_dataset=expanded_dataset,  # 10K+ examples (up from 1,900)
    peft_config=LoraConfig(r=64, lora_alpha=16, target_modules="all-linear"),
)
trainer.train()
```

### Stage 2: DPO on Preference Pairs

```python
from trl import DPOConfig, DPOTrainer

# Dataset: pairs of (correct extraction, incorrect extraction) for same text
trainer = DPOTrainer(
    model="./research-os-sft",  # From stage 1
    args=DPOConfig(
        output_dir="./research-os-dpo",
        learning_rate=5e-7,
        num_train_epochs=1,
        max_length=4096,
        bf16=True,
        push_to_hub=True,
    ),
    train_dataset=preference_dataset,
    peft_config=LoraConfig(r=64, target_modules="all-linear"),
)
```

### Stage 3: GRPO with Epistemic Reward Functions

This is the critical stage that bakes JSON reliability and epistemic correctness into the model:

```python
from trl import GRPOTrainer, GRPOConfig
from trl.rewards import think_format_reward
import json

# ── Reward Function 1: JSON Validity ──
def json_validity_reward(completions, **kwargs):
    """Binary reward: is the output valid JSON?"""
    rewards = []
    for completion in completions:
        content = completion[0]["content"] if isinstance(completion, list) else completion
        try:
            json.loads(content)
            rewards.append(1.0)
        except (json.JSONDecodeError, TypeError):
            rewards.append(0.0)
    return rewards

# ── Reward Function 2: Schema Compliance ──
REQUIRED_KEYS = {"text", "epistemic_tag", "confidence", "missing_fields", "status"}
VALID_TAGS = {"Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"}

def schema_compliance_reward(completions, **kwargs):
    """Reward for matching the Research OS claim schema."""
    rewards = []
    for completion in completions:
        content = completion[0]["content"] if isinstance(completion, list) else completion
        score = 0.0
        try:
            data = json.loads(content)
            claims = data if isinstance(data, list) else data.get("claims", [data])
            
            for claim in claims:
                if not isinstance(claim, dict):
                    continue
                # Key presence: 0.3
                present_keys = set(claim.keys()) & REQUIRED_KEYS
                score += 0.3 * len(present_keys) / len(REQUIRED_KEYS)
                # Valid epistemic tag: 0.3
                if claim.get("epistemic_tag") in VALID_TAGS:
                    score += 0.3
                # Confidence in range: 0.2
                conf = claim.get("confidence", -1)
                if isinstance(conf, (int, float)) and 0 <= conf <= 1:
                    score += 0.2
                # Status consistency: 0.2
                missing = claim.get("missing_fields", [])
                status = claim.get("status", "")
                if (missing and status == "Incomplete") or (not missing and status == "Complete"):
                    score += 0.2
            
            if claims:
                score /= len(claims)
        except:
            pass
        rewards.append(score)
    return rewards

# ── Reward Function 3: Qualifier Preservation ──
HEDGING_WORDS = {"may", "might", "could", "suggests", "possibly", "potentially",
                 "appears", "seems", "likely", "unlikely", "not significant"}

def qualifier_preservation_reward(completions, prompts, **kwargs):
    """Reward for preserving hedging language from source text."""
    rewards = []
    for completion, prompt in zip(completions, prompts):
        content = completion[0]["content"] if isinstance(completion, list) else completion
        prompt_text = prompt[0]["content"] if isinstance(prompt, list) else prompt
        
        # Find hedging words in source
        source_hedges = {w for w in HEDGING_WORDS if w in prompt_text.lower()}
        if not source_hedges:
            rewards.append(0.5)  # Neutral if no hedging in source
            continue
        
        # Check if hedging is preserved in extraction
        try:
            data = json.loads(content)
            claims = data if isinstance(data, list) else data.get("claims", [data])
            claim_text = " ".join(c.get("text", "") for c in claims if isinstance(c, dict)).lower()
            
            preserved = sum(1 for h in source_hedges if h in claim_text)
            rewards.append(preserved / len(source_hedges))
        except:
            rewards.append(0.0)
    return rewards

# ── GRPO Training ──
trainer = GRPOTrainer(
    model="./research-os-dpo",  # From stage 2
    reward_funcs=[
        json_validity_reward,        # Weight: 0.3
        schema_compliance_reward,    # Weight: 0.4
        qualifier_preservation_reward, # Weight: 0.3
    ],
    args=GRPOConfig(
        output_dir="./research-os-grpo",
        learning_rate=1e-6,
        num_generations=8,
        max_completion_length=2048,
        bf16=True,
        gradient_checkpointing=True,
        logging_steps=10,
        push_to_hub=True,
        hub_model_id="nkshirsa/phd-research-os-brain-v2",
        reward_weights=[0.3, 0.4, 0.3],
    ),
    train_dataset=prompt_dataset,  # "prompt" column with paper excerpts
    peft_config=LoraConfig(r=64, target_modules="all-linear"),
)
trainer.train()
```

### Stage 4: Calibration Fine-Tuning (ConfTuner)

After GRPO, apply ConfTuner with tokenized Brier score loss to fix confidence calibration. This is a specialized fine-tuning pass that targets only the confidence output tokens.

---

## 4. Layer Specifications

### 4.0 Layer 0: Structural Ingestion Engine

**Purpose**: Convert PDF bundles into section-aware, bbox-annotated, quality-scored structured regions.

**Technology Stack**:

| Component | Tool | Purpose |
|-----------|------|---------|
| Layout detection | Marker (VikParuchuri/marker) | PDF β†’ structured markdown with layout awareness |
| Math/equation | Nougat (facebookresearch/nougat) | Scientific PDFs β†’ LaTeX equations |
| Bibliographic | GROBID | Headers, authors, citations, references |
| Region classifier | LayoutLMv3 or DocTR | Classify page regions: text, table, figure, equation |
| Plot digitizer | PlotDigitizer (algorithmic) | Quantitative plots β†’ CSV of (x,y) coordinates |
| VLM for figures | Qwen3-VL-8B-Instruct Q4 AWQ | Semantic figure understanding |
| OCR quality | Per-span confidence scoring | Flag degraded regions |

**Output Schema** (per region):

```json
{
  "region_id": "REG_00042",
  "document_type": "main|supplement_1|supplement_2",
  "page": 5,
  "bbox": [72, 340, 540, 420],
  "region_type": "body_text|table|figure|equation|caption|header|reference|footnote",
  "section": "results",
  "subsection": "3.2_sensitivity_characterization",
  "content": {
    "text": "The LOD was 0.8 Β± 0.03 fM (Table 2)",
    "markdown": "The LOD was 0.8 Β± 0.03 fM ([Table 2](#table-2))",
    "parse_method": "marker",
    "parse_confidence": 0.95,
    "ocr_source": false
  },
  "cross_references": [
    {"ref_text": "Table 2", "ref_type": "table", "resolved_to": "REG_00038", "verified": true}
  ],
  "extraction_status": "extractable|low_confidence|unextractable",
  "quality_flags": [],
  "figures": {
    "detected": true,
    "figure_type": "scatter_plot|bar_chart|diagram|micrograph|schematic",
    "digitizable": true,
    "digitized_data": null
  }
}
```

**Chunking Strategy**: Section-aware, NOT page-based.
1. Marker identifies section boundaries (Introduction, Methods, Results subsections)
2. Chunk by section with 1-paragraph overlap to preceding and following sections
3. Tables always kept whole (never split across chunks)
4. Figure + caption always kept together
5. Maximum chunk size: 4096 tokens (model context allows it)

**Paper Bundle Handling**:
```
Input: {
  "main_pdf": "path/to/paper.pdf",
  "supplements": ["path/to/supplement_1.pdf", "path/to/supplement_data.xlsx"],
  "code_repo": "https://github.com/author/repo",
  "dataset": "https://zenodo.org/record/12345"
}
```

### 4.1 Layer 1: Entity Resolution

**Purpose**: Normalize entities, resolve citations, check retractions, establish version lineage.

**Components**:

```
Entity Normalizer
  β”œβ”€β”€ Gene/protein names β†’ UniProt ID
  β”œβ”€β”€ Chemical names β†’ PubChem CID
  β”œβ”€β”€ Disease names β†’ MeSH ID
  β”œβ”€β”€ Assay names β†’ BAO ontology
  β”œβ”€β”€ Abbreviations β†’ canonical form (LRU cache)
  └── Custom domain ontology (user-extensible)

Citation Chain Resolver
  β”œβ”€β”€ In-text "[32]" β†’ reference list β†’ DOI
  β”œβ”€β”€ DOI β†’ CrossRef metadata
  β”œβ”€β”€ Check: is cited paper in knowledge base?
  β”œβ”€β”€ If yes: link claim to original source
  β”œβ”€β”€ If no: flag as "citation_orphan" for potential ingestion
  └── Classify: primary claim vs inherited citation

Version of Record (VoR) Lineage
  β”œβ”€β”€ Before ingestion: query DOI/arXiv for version chain
  β”œβ”€β”€ If preprint exists in DB and VoR arriving: supersede
  β”œβ”€β”€ If VoR exists and erratum arriving: amend specific claims
  β”œβ”€β”€ If retraction: invalidate ALL claims, propagate penalty
  └── Store full lineage: preprint_doi β†’ vor_doi β†’ errata β†’ retraction

Retraction Checker
  β”œβ”€β”€ CrossRef "update-to" relationship
  β”œβ”€β”€ Retraction Watch database (periodic sync via companion model)
  └── Propagate retraction status through citation chains
```

### 4.2 Layer 2: Qualified Extraction

**Purpose**: Extract claims with full epistemic qualification using the AI Model Council.

**Council Architecture** (Parallel-Then-Merge):

```
Round 1 (PARALLEL β€” no visibility between members):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Query Plannerβ”‚  β”‚  Extractor   β”‚  β”‚  Extractor 2 β”‚  β”‚   Critic     β”‚
  β”‚ (decompose)  β”‚  β”‚ (Qwen3-8B)  β”‚  β”‚ (if heterog.)β”‚  β”‚ (adversarial)β”‚
  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                 β”‚                 β”‚                 β”‚
         β–Ό                 β–Ό                 β–Ό                 β–Ό
  sub-queries         claims_A          claims_B          critique

Round 2 (DEBATE β€” see tags and reasoning, NOT confidence):
  All members see each other's epistemic tags and reasoning chains
  Each member can revise their classification
  Confidence scores remain HIDDEN (prevents anchoring)

Round 3 (SYNTHESIS β€” Chairman):
  Chairman sees everything including confidence
  Applies completeness penalty (code-enforced, not prompt-instructed)
  Resolves disagreements with documented reasoning
  Tags each claim with council_vote_distribution
```

**Epistemic Separation Engine**:

| Section | Epistemic Default | Confidence Modifier |
|---------|-------------------|-------------------|
| Results (with statistics) | Fact (if p < threshold) | 1.0 |
| Results (narrative) | Interpretation | 0.85 |
| Methods | Protocol metadata (not a claim) | N/A |
| Abstract | Interpretation (forced) | 0.7 penalty |
| Discussion | Interpretation or Hypothesis | 0.75 penalty |
| Conclusion | Cross-check against Results | 0.8 if supported, 0.5 if not |
| Supplement | Same as main body section rules | 1.0 (no penalty for supplement source) |

**Constrained Decoding** (Guidance engine):

```python
from guidance import models, gen, select

TAGS = ["Fact", "Interpretation", "Hypothesis", "Conflict_Hypothesis"]

lm = models.Transformers("./research-os-grpo")  # Local model

with lm:
    output = lm + f"""
    Analyze this scientific text and extract claims.
    
    Text: {section_text}
    Section: {section_name}
    
    <reasoning>{gen("reasoning", max_tokens=500)}</reasoning>
    
    Claims:
    [
      {{
        "text": "{gen("claim_text", max_tokens=200)}",
        "epistemic_tag": "{select(TAGS, name="tag")}",
        "confidence_components": {{
          "evidence_strength": {gen("ev_str", regex=r"0\.[0-9][0-9]?[0-9]?", name="evidence")},
          "qualifiers": ["{gen("qualifiers", max_tokens=100)}"]
        }},
        "source_quote": "{gen("source_quote", max_tokens=200)}",
        "source_page": {gen("page", regex=r"[0-9]+", name="page")},
        "is_null_result": {select(["true", "false"], name="is_null")},
        "is_inherited_citation": {select(["true", "false"], name="is_inherited")}
      }}
    ]
    """
# output["tag"] is GUARANTEED to be in TAGS
# output["is_null"] is GUARANTEED to be boolean
```

**Claim Schema v2** (expanded from v1):

```json
{
  "claim_id": "CLM_00042",
  "text": "The LOD was 0.8 fM in 10 mM PBS",
  "epistemic_tag": "Fact",
  "confidence": 0.855,
  "confidence_components": {
    "evidence_strength": 900,
    "study_quality_weight": 1000,
    "journal_tier_weight": 1000,
    "completeness_penalty": 1000,
    "section_modifier": 1000,
    "qualifier_penalty": 950
  },
  "qualifiers": ["in 10 mM PBS only", "n=5"],
  "missing_fields": [],
  "status": "Complete",
  "is_null_result": false,
  "is_inherited_citation": false,
  "causal_direction": "observed_correlation",
  "statistical_evidence": {
    "p_value": 0.001,
    "effect_size": 2.1,
    "effect_size_type": "cohens_d",
    "sample_size": 5,
    "confidence_interval": [0.6, 1.0],
    "practical_significance": true
  },
  "source_quote": "The limit of detection was determined to be 0.8 fM using the 3Οƒ/slope method.",
  "source_page": 5,
  "source_bbox": [72, 340, 540, 365],
  "source_section": "results",
  "source_doi": "10.1234/example",
  "council_vote": {
    "extractor_1": {"tag": "Fact", "reasoning": "Direct measurement with statistics"},
    "extractor_2": {"tag": "Fact", "reasoning": "Quantitative with clear methodology"},
    "critic": {"tag": "Fact", "reasoning": "Supported by Table 2 data"},
    "chairman": {"tag": "Fact", "reasoning": "Unanimous agreement, strong statistics"}
  },
  "granularity": "atomic",
  "parent_claim_id": null,
  "sub_claims": [],
  "ontology_version": "quantum_bio_v1",
  "pipeline_version": "2.1.0",
  "taxonomy_version": "quantum_bio_v1",
  "extraction_timestamp": "2026-04-23T10:30:00Z"
}
```

### 4.3 Layer 3: Canonicalization

**Purpose**: Deduplicate claims, merge aliases, aggregate evidence, track temporal versions.

```
New claim arrives β†’
  1. Embed claim text (local embedding model or Qwen3-8B last-hidden-state)
  2. Search existing canonical claims (cosine similarity)
  3. If similarity > 0.85:
     β”œβ”€β”€ MERGE: Add new source as evidence for existing canonical claim
     β”œβ”€β”€ Update evidence_count, source_list, confidence (re-aggregate)
     β”œβ”€β”€ If confidence_components differ significantly: flag for human review
     └── Store alias mapping: new_claim_id β†’ canonical_claim_id
  4. If similarity 0.70-0.85:
     β”œβ”€β”€ FLAG as "potential duplicate β€” review recommended"
     └── Show both claims in review queue with similarity score
  5. If similarity < 0.70:
     └── CREATE new canonical claim
```

**Temporal Versioning**:
```
canonical_claim:
  version_history: [
    {version: 1, source: "preprint_2024", confidence: 0.65, date: "2024-03"},
    {version: 2, source: "vor_2024", confidence: 0.85, date: "2024-09"},
    {version: 3, source: "new_study_2025", confidence: 0.90, date: "2025-02"}
  ]
  current_version: 3
  supersedes: null
  superseded_by: null
```

### 4.4 Layer 4: Knowledge Graph

**Implementation**: SQLite-backed adjacency list (NOT Neo4j β€” keeps the system local and zero-dependency).

**Schema**:

```sql
CREATE TABLE graph_nodes (
    node_id TEXT PRIMARY KEY,       -- canonical_claim_id or entity_id
    node_type TEXT NOT NULL,        -- claim | entity | method | condition
    label TEXT NOT NULL,
    properties TEXT,                -- JSON
    created_at TEXT NOT NULL
);

CREATE TABLE graph_edges (
    edge_id TEXT PRIMARY KEY,
    source_node TEXT NOT NULL,
    target_node TEXT NOT NULL,
    edge_type TEXT NOT NULL,        -- supports | refutes | extends | depends_on |
                                    -- supersedes | blocks | investigative_hypothesis |
                                    -- method_uses | condition_applies
    confidence INTEGER NOT NULL,    -- Fixed-point Γ—1000
    evidence_sources TEXT,          -- JSON array of source DOIs
    is_inferred INTEGER DEFAULT 0,  -- 0=observed, 1=inferred (transitive)
    inference_chain TEXT,           -- JSON: hop details if inferred
    method_compatible INTEGER,      -- NULL=unchecked, 0=incompatible, 1=compatible
    created_at TEXT NOT NULL,
    updated_at TEXT NOT NULL,
    FOREIGN KEY(source_node) REFERENCES graph_nodes(node_id),
    FOREIGN KEY(target_node) REFERENCES graph_nodes(node_id)
);

-- Index for fast graph traversal
CREATE INDEX idx_edges_source ON graph_edges(source_node);
CREATE INDEX idx_edges_target ON graph_edges(target_node);
CREATE INDEX idx_edges_type ON graph_edges(edge_type);
```

**Edge Types**:

| Type | Meaning | Confidence Rule |
|------|---------|----------------|
| `supports` | Claim A provides evidence for Claim B | From source text, observed |
| `refutes` | Claim A contradicts Claim B | From source text or conflict detection |
| `extends` | Claim A adds conditions/parameters to B | Section analysis |
| `depends_on` | Claim A assumes Claim B is true | Citation chain analysis |
| `supersedes` | Claim A replaces older Claim B (newer data) | Temporal versioning |
| `blocks` | Null finding: no evidence of relationship | Null result extraction |
| `investigative_hypothesis` | Inferred multi-hop (NOT observed) | min(hop_confidences) Γ— 0.5 |

**Transitive Inference Constraints**:
- NEVER auto-generate `supports` across multiple hops
- Only `investigative_hypothesis` edges for multi-hop
- Require method_compatible=1 for each hop before generating inference
- Default queries return observed edges only
- `include_inferred=True` flag required for graph queries that include inferences

**Gap Analysis Protocol**:
```python
def find_gaps(self, domain_id: str) -> list:
    """Find structural holes in the knowledge graph."""
    # 1. Get all entities in domain
    entities = self.get_entities(domain_id)
    
    # 2. For each entity pair in same domain
    for a, b in combinations(entities, 2):
        # 3. Check if edge exists
        edges = self.get_edges(a.id, b.id)
        if not edges:
            # 4. Check if both are well-connected (dense neighborhood)
            a_degree = self.get_degree(a.id)
            b_degree = self.get_degree(b.id)
            if a_degree > 3 and b_degree > 3:
                # 5. This is a high-value gap
                info_gain = (a_degree + b_degree) / max_degree
                gaps.append({
                    "entity_a": a, "entity_b": b,
                    "information_gain": info_gain,
                    "suggested_action": "experiment" if info_gain > 0.7 else "literature_search"
                })
    
    return sorted(gaps, key=lambda g: -g["information_gain"])
```

### 4.5 Layer 5: Calibrated Scoring

**Purpose**: Compute confidence using CODE, not LLM. Three separate scores.

```python
def compute_claim_scores(claim: dict, source: dict, section: str) -> dict:
    """
    Code-computed scoring. The LLM provides COMPONENTS, 
    the code computes the FINAL SCORES.
    
    The LLM NEVER sets the final confidence directly.
    """
    # ── Score 1: Evidence Quality ──
    evidence_strength = claim["confidence_components"]["evidence_strength"]  # From LLM
    study_quality = taxonomy.get_weight(source["study_type"], domain_id)     # From taxonomy
    journal_tier = JOURNAL_TIER_WEIGHTS[source["journal_tier"]]              # From config
    completeness = 700 if claim["missing_fields"] else 1000                  # Binary: code enforced
    section_mod = SECTION_MODIFIERS[section]                                  # From config
    
    # Fixed-point multiplication chain
    evidence_quality = (evidence_strength * study_quality // 1000 
                       * journal_tier // 1000 
                       * completeness // 1000
                       * section_mod // 1000)
    
    # ── Score 2: Claim Truth Likelihood ──
    # Based on evidence quality + source count + conflict status
    source_count_bonus = min(claim["evidence_count"] * 50, 200)  # Max +0.2 for multiple sources
    conflict_penalty = -300 if claim.get("has_active_conflict") else 0
    null_evidence_penalty = -200 if claim.get("has_null_evidence") else 0
    
    truth_likelihood = min(1000, max(0,
        evidence_quality + source_count_bonus + conflict_penalty + null_evidence_penalty
    ))
    
    # ── Score 3: Qualifier Strength ──
    # How definitive is the claim's language?
    qualifier_count = len(claim.get("qualifiers", []))
    is_null = claim.get("is_null_result", False)
    is_inherited = claim.get("is_inherited_citation", False)
    
    qualifier_strength = 1000
    if qualifier_count > 0:
        qualifier_strength -= qualifier_count * 100  # -0.1 per qualifier
    if is_null:
        qualifier_strength = min(qualifier_strength, 500)  # Cap at 0.5 for null results
    if is_inherited:
        qualifier_strength -= 200  # -0.2 for inherited citations
    qualifier_strength = max(0, qualifier_strength)
    
    # ── Statistical Evidence Gate ──
    stats = claim.get("statistical_evidence", {})
    if stats.get("effect_size") is not None:
        effect = stats["effect_size"]
        sample_n = stats.get("sample_size", 0)
        
        # Large N + tiny effect = statistically significant but practically meaningless
        if sample_n > 1000 and abs(effect) < 0.1:
            # Override: this is NOT practically significant
            evidence_quality = min(evidence_quality, 400)  # Cap at 0.4
            claim["practical_significance"] = False
    
    # ── Parser Confidence Propagation ──
    parse_conf = claim.get("parse_confidence", 1000)
    evidence_quality = min(evidence_quality, parse_conf)  # Parser uncertainty CAPS claim
    
    return {
        "evidence_quality": evidence_quality,            # Fixed-point Γ—1000
        "truth_likelihood": truth_likelihood,            # Fixed-point Γ—1000
        "qualifier_strength": qualifier_strength,        # Fixed-point Γ—1000
        "composite_confidence": (evidence_quality + truth_likelihood + qualifier_strength) // 3,
        "practical_significance": claim.get("practical_significance", True),
    }
```

### 4.6 Layer 6: Evaluation

**Evaluation Pipeline** (runs in CI/CD on every prompt/model/taxonomy change):

```
1. STRUCTURAL TESTS (existing 119 tests β€” code correctness)
   └── pytest tests/ β†’ all pass?

2. GOLDEN DATASET REGRESSION (versioned annotations)
   β”œβ”€β”€ Extraction recall β‰₯ 70%
   β”œβ”€β”€ Hallucination rate ≀ 10%
   β”œβ”€β”€ Epistemic accuracy β‰₯ 60%
   β”œβ”€β”€ Qualifier preservation rate β‰₯ 80% (NEW)
   └── Null result detection rate β‰₯ 50% (NEW)

3. LLM-AS-JUDGE (faithfulness & grounding)
   β”œβ”€β”€ Faithfulness: does extracted claim appear in source text?
   β”œβ”€β”€ Grounding: can claim be traced to specific source quote?
   β”œβ”€β”€ Tag correctness: does epistemic tag match expert judgment?
   β”œβ”€β”€ Qualifier preservation: are hedging words maintained?
   └── Run on 5 golden papers, 3 times each (stochastic check)

4. CALIBRATION CHECK (monthly)
   β”œβ”€β”€ Brier score from calibration_log
   β”œβ”€β”€ Alert if ECE > 0.25
   └── Trigger ConfTuner re-training if needed

5. HIDDEN HOLDOUT (never seen during development)
   β”œβ”€β”€ 3 papers reserved, never used in training or golden set
   β”œβ”€β”€ Evaluated quarterly
   └── Detects benchmark overfitting
```

**Versioned Annotation Guidelines**:
```
/evaluation/
β”œβ”€β”€ guidelines_v1.0.md           # Annotation rules (version controlled)
β”œβ”€β”€ golden_dataset/
β”‚   β”œβ”€β”€ paper_001.json           # Annotated under guidelines v1.0
β”‚   β”œβ”€β”€ paper_002.json           # Annotated under guidelines v1.0
β”‚   └── paper_006.json           # Annotated under guidelines v1.1
β”œβ”€β”€ frozen_anchors/              # NEVER re-annotated
β”‚   β”œβ”€β”€ paper_001_frozen.json
β”‚   └── paper_002_frozen.json
└── holdout/                     # NEVER seen during development
    β”œβ”€β”€ paper_H1.json
    └── paper_H2.json
```

### 4.7 Layer 7: Provenance & Reproducibility

**Output Lineage** (every claim tagged):
```json
{
  "pipeline_version": "2.1.0",
  "model_checkpoint": "research-os-grpo-v2-step-5000",
  "parser_version": "marker-1.2.0",
  "taxonomy_version": "quantum_bio_v1",
  "prompt_hash": "sha256:a3b4c5...",
  "extraction_timestamp": "2026-04-23T10:30:00Z",
  "guidance_schema_version": "1.0"
}
```

**Security Sandbox** (for repository validation):
```
β”Œβ”€β”€β”€ SANDBOX (isolated from main system) ─────────────────┐
β”‚  β€’ Timeout: 60 seconds max per URL check                 β”‚
β”‚  β€’ Network: HTTP GET only, no POST/PUT/DELETE             β”‚
β”‚  β€’ Download limit: 100MB per artifact                    β”‚
β”‚  β€’ No code execution (dry-run validation only)           β”‚
β”‚  β€’ Actual code execution requires human authorization    β”‚
β”‚  β€’ Credential isolation: no access to main DB or API keysβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Epistemic Embargo** (for IP protection):
```
User creates "Private Graph" β†’ 
  All claims extracted in this mode go to private subgraph β†’
  Private subgraph is NOT visible to other users / companion agents β†’
  After paper submission: user clicks "Merge to Lab Graph" β†’
  Claims move from private to shared graph with full provenance
```

---

## 5. UI Architecture

### 5.1 Courtroom UI (Conflict Resolution)

```
Default View (Review Queue):
  ⚠️ 3-way conflict detected β€” Debye screening threshold
  Papers: Chen 2022, Nakamura 2023, Williams 2024
  Comparability confidence: 0.58 (method differences detected)
  [Review] [Defer] [Dismiss]

Expanded View (Courtroom β€” click to open):
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Chen 2022   β”‚ Nakamura 23 β”‚ Williams 24 β”‚
  β”‚ ACS Nano T1 β”‚ Biosens. T1 β”‚ Sensors T3  β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Claim text  β”‚ Claim text  β”‚ Claim text  β”‚
  β”‚ (nestable)  β”‚ (nestable)  β”‚ (nestable)  β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚ Method box  β”‚ Method box  β”‚ Method box  β”‚
  β”‚ N=5 p<.001  β”‚ N=12 p<.01 β”‚ N=3 p=.12  β”‚
  β”‚ [PDFπŸ“„]     β”‚ [PDFπŸ“„]     β”‚ [PDFπŸ“„]     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  
  System Analysis (Level 5 β€” unverified):
  "These claims are not directly comparable..."
  Confidence in analysis: 0.62
  
  Council Votes: Ext1: scope_diff | Ext2: value_mismatch | Critic: scope_diff
  
  [Agree] [Override with custom] [Defer β€” need more info]
  
  ⚠️ Missing competitor evidence:
  "3 papers cited by these sources are not yet ingested"
  [Ingest Park 2023] [Ingest Liu 2024] [Ingest Fernandez 2023]
```

### 5.2 Progressive Disclosure Levels

```
Level 0: Dashboard
  Epistemic Health Score per claim cluster
  Today's review queue (priority-ranked)
  
Level 1: Claim Detail
  Text + tag + composite confidence + source
  [Expand to see scoring breakdown]
  
Level 2: Scoring Breakdown
  3 separate scores (evidence, truth, qualifier)
  Statistical evidence if available
  Parser confidence for this region
  
Level 3: Provenance Chain
  Source quote + page + bbox
  Council vote distribution
  Pipeline version + model checkpoint
  
Level 4: Graph Neighborhood
  2-hop subgraph around this claim
  Typed edges visible
  Inferred edges dashed + labeled
  
Level 5: Full Debug
  Raw LLM outputs from each council member
  Token-level confidence distribution
  Parse regions and quality flags
```

### 5.3 Manual Synthesis Mode

```
[Toggle] 🧠 Manual Synthesis Mode: ON

In this mode:
  βœ… Claims displayed (text + source)
  βœ… Organized by topic clusters
  ❌ NO confidence scores shown
  ❌ NO conflict flags shown  
  ❌ NO gap analysis shown
  ❌ NO system suggestions
  
  The researcher draws connections manually.
  Then switches back to compare with system's analysis.
```

---

## 6. Local Deployment

### 6.1 Minimal Setup (16GB VRAM)

```bash
# 1. Install Ollama (simplest local LLM server)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull quantized model (after fine-tuning and uploading GGUF)
ollama pull nkshirsa/research-os-brain:q4_k_m

# 3. Verify it's running
curl http://localhost:11434/api/generate -d '{"model": "research-os-brain:q4_k_m", "prompt": "test"}'

# 4. Start the Research OS
pip install -r requirements.txt
python -m phd_research_os.serve --model ollama://research-os-brain:q4_k_m --port 8080

# 5. Open UI
# http://localhost:8080
```

### 6.2 VRAM Budget

```
Qwen3-8B Q4 AWQ weights:     ~5.0 GB
PolarQuant KV cache (128K):   ~3.8 GB
Qwen3-VL-8B Q4 (for figures): ~5.0 GB (loaded on-demand, not persistent)
Guidance engine overhead:      ~0.5 GB
ChromaDB embeddings:           ~0.5 GB
──────────────────────────────────────
Total (text only):             ~9.8 GB  ← fits 16GB GPU
Total (with VLM loaded):      ~14.8 GB  ← fits 16GB GPU (tight)
Total (with VLM on-demand):    ~9.8 GB  ← swap VLM in/out per figure
```

---

## 7. Data Flow (Complete Pipeline)

```
PDF Bundle arrives
  β”‚
  β–Ό
LAYER 0: Structural Ingestion
  β”œβ”€β”€ Marker: layout-aware markdown with section boundaries
  β”œβ”€β”€ Nougat: equations β†’ LaTeX (routed by region classifier)
  β”œβ”€β”€ GROBID: references β†’ structured citations
  β”œβ”€β”€ Figure regions β†’ classify β†’ VLM (semantic) or Digitizer (quantitative)
  β”œβ”€β”€ Per-region quality scoring (parse_confidence, ocr_confidence)
  β”œβ”€β”€ Cross-reference verification (Figure 3 β†’ correct figure object?)
  └── Output: list of annotated regions with bbox, section, quality
  β”‚
  β–Ό
LAYER 1: Entity Resolution
  β”œβ”€β”€ Normalize entities (gene names, chemicals, assays β†’ canonical IDs)
  β”œβ”€β”€ Resolve in-text citations ([32] β†’ DOI β†’ metadata)
  β”œβ”€β”€ Check VoR lineage (is this a preprint we already have?)
  β”œβ”€β”€ Check retraction status (CrossRef + Retraction Watch)
  └── Tag: primary vs inherited claims
  β”‚
  β–Ό
LAYER 2: Qualified Extraction (AI Model Council)
  β”œβ”€β”€ Round 1 (parallel): Query Planner + 2 Extractors + Critic
  β”‚   Each independently processes section-aware chunks
  β”‚   Guidance engine enforces: valid JSON, valid tags, valid ranges
  β”‚   Section modifier applied (Abstract=0.7, Results=1.0, Discussion=0.75)
  β”œβ”€β”€ Round 2 (debate): Share tags + reasoning (NOT confidence)
  β”œβ”€β”€ Round 3 (chairman): Synthesize final claims
  β”‚   Apply completeness penalty (code-enforced: 0.7 if missing fields)
  β”‚   Preserve qualifiers from source text
  β”‚   Extract statistical evidence (N, p, d, CI)
  β”‚   Tag null results, inherited citations, causal direction
  └── Output: list of qualified claims with full provenance
  β”‚
  β–Ό
LAYER 3: Canonicalization
  β”œβ”€β”€ Embed each new claim
  β”œβ”€β”€ Compare against existing canonical claims (cosine > 0.85 = merge)
  β”œβ”€β”€ Merge: add source as evidence, update confidence aggregation
  β”œβ”€β”€ Create: new canonical claim with first source
  └── Temporal versioning: if same claim from VoR supersedes preprint version
  β”‚
  β–Ό
LAYER 4: Knowledge Graph
  β”œβ”€β”€ Insert claim as graph node
  β”œβ”€β”€ Create edges from citation analysis (supports, depends_on)
  β”œβ”€β”€ Run conflict detector (keyword + embedding similarity for candidates)
  β”œβ”€β”€ Council evaluates candidate conflicts β†’ typed edges (refutes, scope_diff)
  β”œβ”€β”€ Check for null evidence β†’ blocking edges
  β”œβ”€β”€ Update method-compatibility metadata on edges
  β”œβ”€β”€ Cluster related conflicts into case files
  └── Run gap analysis (if in Research Landscape mode)
  β”‚
  β–Ό
LAYER 5: Calibrated Scoring (CODE-COMPUTED)
  β”œβ”€β”€ evidence_quality = evidence Γ— quality Γ— tier Γ— completeness Γ— section
  β”œβ”€β”€ truth_likelihood = evidence_quality + source_bonus - conflict_penalty
  β”œβ”€β”€ qualifier_strength = 1.0 - qualifier_countΓ—0.1 - null_penalty - inherited_penalty
  β”œβ”€β”€ Statistical evidence gate: large N + tiny effect β†’ cap confidence
  β”œβ”€β”€ Parser confidence propagation: parse_confidence caps evidence_quality
  └── Store all 3 scores + composite on claim
  β”‚
  β–Ό
LAYER 6: Evaluation (on config change)
  β”œβ”€β”€ Regression gate against golden dataset
  β”œβ”€β”€ LLM-as-Judge faithfulness + grounding check
  β”œβ”€β”€ Brier score monitoring (monthly)
  └── Hidden holdout benchmark (quarterly)
  β”‚
  β–Ό
LAYER 7: Provenance
  β”œβ”€β”€ Tag claim with full pipeline version lineage
  β”œβ”€β”€ Store bbox + source quote for UI traceability
  └── Export: Obsidian vault, Courtroom UI, CSV, BibTeX
```

---

## 8. Implementation Phases (Aligned with PhD Timeline)

### Phase A: Foundation (Weeks 1-6) β€” MUST BE FIRST

| Week | Task | Deliverable |
|------|------|-------------|
| 1-2 | Integrate Marker for PDF β†’ structured markdown | Section-aware regions with bbox |
| 3 | Add Nougat routing for equation-heavy regions | LaTeX preservation |
| 4 | Implement section-aware chunking (replace page-based) | Semantic chunks |
| 5 | Add quality scoring per-region | parse_confidence on every span |
| 6 | Integrate Guidance engine for constrained decoding | Guaranteed valid JSON output |

### Phase B: Identity (Weeks 7-12)

| Week | Task | Deliverable |
|------|------|-------------|
| 7-8 | Claim canonicalization with embedding dedup | Canonical registry |
| 9 | Entity normalization (abbreviations, synonyms) | Ontology mapper |
| 10-11 | Citation chain resolution ([32] β†’ DOI) | Primary vs inherited tagging |
| 12 | VoR lineage detection | Preprint β†’ VoR superseding |

### Phase C: Structure (Weeks 13-20)

| Week | Task | Deliverable |
|------|------|-------------|
| 13-14 | SQLite-backed knowledge graph with typed edges | Graph schema + CRUD |
| 15-16 | Qualifier preservation + null result handling | Blocking edges |
| 17-18 | Method-compatibility layer | Comparability confidence |
| 19-20 | Conflict clustering into case files | Case file UI |

### Phase D: Calibration (Weeks 21-26)

| Week | Task | Deliverable |
|------|------|-------------|
| 21-22 | Epistemic Separation Engine (section modifiers) | Section-aware scoring |
| 23-24 | Statistical evidence extraction (N, p, d, CI) | Practical significance gate |
| 25-26 | GRPO training with epistemic reward functions | Trained model v2 |

### Phase E: Judgment (Weeks 27-32)

| Week | Task | Deliverable |
|------|------|-------------|
| 27-28 | Courtroom UI with PDF.js bounding box viewer | Provenance display |
| 29-30 | Council parallel-then-merge architecture | Hidden confidence protocol |
| 31-32 | Conflict clustering + case file resolution | Batch conflict resolution |

### Phase F: Longevity (Ongoing, PhD Year 1+)

| Task | Trigger |
|------|---------|
| Versioned ontology with backward-compatible queries | 3rd taxonomy update |
| VoR lineage tracking | First preprint β†’ VoR encounter |
| Ongoing Brier calibration monitoring | 50+ calibration data points |
| Gold-standard drift detection | 2nd annotation batch |
| Gap Analysis Protocol | 100+ papers ingested |
| Manual Synthesis Mode | Thesis writing phase |

---

## 9. File Structure (v2.0)

```
phd-research-os/
β”œβ”€β”€ SYSTEM_DESIGN.md                    # THIS DOCUMENT
β”œβ”€β”€ BLINDSPOT_AUDIT_COMPLETE.md         # 87-blindspot audit
β”‚
β”œβ”€β”€ phd_research_os/                    # Core Python package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚
β”‚   β”œβ”€β”€ layer0/                         # Structural Ingestion
β”‚   β”‚   β”œβ”€β”€ parser.py                   # Marker + Nougat + GROBID orchestrator
β”‚   β”‚   β”œβ”€β”€ region_classifier.py        # LayoutLMv3 region classification
β”‚   β”‚   β”œβ”€β”€ chunker.py                  # Section-aware chunking
β”‚   β”‚   β”œβ”€β”€ figure_router.py            # VLM vs Digitizer routing
β”‚   β”‚   β”œβ”€β”€ plot_digitizer.py           # Quantitative plot β†’ CSV
β”‚   β”‚   β”œβ”€β”€ quality_scorer.py           # Per-span quality scoring
β”‚   β”‚   └── cross_ref_verifier.py       # Figure/Table reference integrity
β”‚   β”‚
β”‚   β”œβ”€β”€ layer1/                         # Entity Resolution
β”‚   β”‚   β”œβ”€β”€ entity_normalizer.py        # Ontology-aware normalization
β”‚   β”‚   β”œβ”€β”€ citation_resolver.py        # In-text [32] β†’ DOI
β”‚   β”‚   β”œβ”€β”€ vor_lineage.py              # Version of Record tracking
β”‚   β”‚   └── retraction_checker.py       # CrossRef + Retraction Watch
β”‚   β”‚
β”‚   β”œβ”€β”€ layer2/                         # Qualified Extraction
β”‚   β”‚   β”œβ”€β”€ council.py                  # Parallel-then-merge council (upgraded)
β”‚   β”‚   β”œβ”€β”€ epistemic_separator.py      # Abstract vs Results scoring
β”‚   β”‚   β”œβ”€β”€ qualifier_extractor.py      # Hedging, negation, conditions
β”‚   β”‚   β”œβ”€β”€ statistical_extractor.py    # N, p, d, CI extraction
β”‚   β”‚   β”œβ”€β”€ constrained_decoder.py      # Guidance engine integration
β”‚   β”‚   └── ood_detector.py             # Mahalanobis distance OOD gating
β”‚   β”‚
β”‚   β”œβ”€β”€ layer3/                         # Canonicalization
β”‚   β”‚   β”œβ”€β”€ deduplicator.py             # Embedding-based near-duplicate detection
β”‚   β”‚   β”œβ”€β”€ canonical_registry.py       # Canonical claim management
β”‚   β”‚   β”œβ”€β”€ alias_merger.py             # Alias mapping and merging
β”‚   β”‚   └── temporal_versioner.py       # Claim version history
β”‚   β”‚
β”‚   β”œβ”€β”€ layer4/                         # Knowledge Graph
β”‚   β”‚   β”œβ”€β”€ graph.py                    # SQLite-backed graph with typed edges
β”‚   β”‚   β”œβ”€β”€ conflict_detector.py        # Pairwise conflict detection (upgraded)
β”‚   β”‚   β”œβ”€β”€ conflict_clusterer.py       # Case file generation
β”‚   β”‚   β”œβ”€β”€ method_compatibility.py     # Cross-paper method comparison
β”‚   β”‚   β”œβ”€β”€ gap_analyzer.py             # Structural hole detection
β”‚   β”‚   └── transitive_constraints.py   # Multi-hop inference safety
β”‚   β”‚
β”‚   β”œβ”€β”€ layer5/                         # Calibrated Scoring
β”‚   β”‚   β”œβ”€β”€ scorer.py                   # Code-computed 3-score system
β”‚   β”‚   β”œβ”€β”€ statistical_gate.py         # Effect size / practical significance
β”‚   β”‚   β”œβ”€β”€ section_modifiers.py        # Abstract/Results/Discussion weights
β”‚   β”‚   └── calibration_monitor.py      # Brier score tracking
β”‚   β”‚
β”‚   β”œβ”€β”€ layer6/                         # Evaluation
β”‚   β”‚   β”œβ”€β”€ regression_gate.py          # Golden dataset regression
β”‚   β”‚   β”œβ”€β”€ llm_judge.py               # Faithfulness/grounding evaluation
β”‚   β”‚   β”œβ”€β”€ stochastic_tester.py        # Run-N-times variance check
β”‚   β”‚   └── annotation_drift.py         # Gold-standard drift detection
β”‚   β”‚
β”‚   β”œβ”€β”€ layer7/                         # Provenance
β”‚   β”‚   β”œβ”€β”€ lineage_tagger.py           # Pipeline version tagging
β”‚   β”‚   β”œβ”€β”€ security_sandbox.py         # Isolated URL/repo validation
β”‚   β”‚   β”œβ”€β”€ license_checker.py          # Usage rights verification
β”‚   β”‚   └── embargo_manager.py          # Private graph / merge workflow
β”‚   β”‚
β”‚   β”œβ”€β”€ ui/                             # Gradio UI
β”‚   β”‚   β”œβ”€β”€ app.py                      # Main application
β”‚   β”‚   β”œβ”€β”€ courtroom.py                # Conflict resolution courtroom
β”‚   β”‚   β”œβ”€β”€ dashboard.py                # Epistemic health dashboard
β”‚   β”‚   β”œβ”€β”€ pdf_viewer.py               # PDF.js with bbox highlighting
β”‚   β”‚   β”œβ”€β”€ manual_synthesis.py         # AI-free exploration mode
β”‚   β”‚   └── export.py                   # CSV, BibTeX, JSON, Obsidian export
β”‚   β”‚
β”‚   β”œβ”€β”€ core/                           # Shared infrastructure
β”‚   β”‚   β”œβ”€β”€ db.py                       # SQLite data layer (existing, extended)
β”‚   β”‚   β”œβ”€β”€ taxonomy.py                 # Quantum-Bio V2 (existing)
β”‚   β”‚   β”œβ”€β”€ agents.py                   # Brain interface (existing, upgraded)
β”‚   β”‚   β”œβ”€β”€ agent_os.py                 # ECC Harness (existing)
β”‚   β”‚   β”œβ”€β”€ meta_improver.py            # Meta-Improver (existing)
β”‚   β”‚   └── skills/                     # Superpowers (existing)
β”‚   β”‚
β”‚   β”œβ”€β”€ training/                       # Model training
β”‚   β”‚   β”œβ”€β”€ train_sft.py                # Stage 1: SFT
β”‚   β”‚   β”œβ”€β”€ train_dpo.py                # Stage 2: DPO
β”‚   β”‚   β”œβ”€β”€ train_grpo.py              # Stage 3: GRPO with epistemic rewards
β”‚   β”‚   β”œβ”€β”€ train_calibration.py        # Stage 4: ConfTuner
β”‚   β”‚   β”œβ”€β”€ reward_functions.py         # JSON validity, schema, qualifier rewards
β”‚   β”‚   └── generate_dataset.py         # Synthetic + real data generation
β”‚   β”‚
β”‚   └── config/                         # Version-controlled configuration
β”‚       β”œβ”€β”€ prompts/                    # All system prompts (git-tracked)
β”‚       β”œβ”€β”€ taxonomy/                   # Domain taxonomies
β”‚       β”œβ”€β”€ scoring/                    # Weight tables, thresholds
β”‚       └── evaluation/                 # Golden dataset + guidelines
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_layer0.py                  # Structural ingestion tests
β”‚   β”œβ”€β”€ test_layer1.py                  # Entity resolution tests
β”‚   β”œβ”€β”€ test_layer2.py                  # Extraction tests
β”‚   β”œβ”€β”€ test_layer3.py                  # Canonicalization tests
β”‚   β”œβ”€β”€ test_layer4.py                  # Knowledge graph tests
β”‚   β”œβ”€β”€ test_layer5.py                  # Scoring tests
β”‚   β”œβ”€β”€ test_layer6.py                  # Evaluation tests
β”‚   β”œβ”€β”€ test_layer7.py                  # Provenance tests
β”‚   β”œβ”€β”€ test_db.py                      # Data layer (existing 22 tests)
β”‚   β”œβ”€β”€ test_agent_os.py                # ECC harness (existing 21 tests)
β”‚   β”œβ”€β”€ test_taxonomy.py                # Taxonomy (existing 27 tests)
β”‚   β”œβ”€β”€ test_skills_and_meta.py         # Skills + meta (existing 30 tests)
β”‚   └── test_council.py                 # Council (existing 19 tests)
β”‚
└── docs/
    β”œβ”€β”€ ARCHITECTURE.md                 # Project map (existing)
    β”œβ”€β”€ AGENTS.md                       # Agent registry (existing)
    β”œβ”€β”€ USAGE.md                        # Daily workflow guide
    β”œβ”€β”€ ANNOTATION_GUIDELINES.md        # Versioned golden dataset rules
    └── DEPLOYMENT.md                   # Local setup guide
```

---

## 10. Success Criteria

The system is DONE when:

1. **A researcher can drop a PDF and get back epistemic-tagged claims with source bounding boxes in under 5 minutes**
2. **Two claims from different papers that say the same thing are automatically recognized as the same canonical claim**
3. **A null result creates a blocking edge, not a gap, in the knowledge graph**
4. **An Abstract claim that overstates the Results gets automatically penalized**
5. **The courtroom shows three conflicting papers side-by-side with method comparison and the researcher can resolve in 2 clicks**
6. **The gap analyzer identifies untested entity pairs and generates Decision Objects**
7. **The system knows when it doesn't know β€” OOD papers, unextractable regions, and uncalibrated confidence all surface to the human**
8. **All of the above works on a 16GB consumer GPU with zero internet dependency for paper processing**

---

*This design addresses all 87 blindspots from the complete audit.*
*Implementation timeline: ~32 weeks pre-PhD + ongoing during PhD Year 1-3.*
*The hardest part is not building it. It's keeping it honest.*

---

## Appendix A: Future Architecture Directions

> **Status**: Research-Backed Design Proposals β€” Not Yet Implemented
>
> The following sections describe architecture improvements validated by recent peer-reviewed research. Each addresses a specific bottleneck in the current v2.0 design. Implementation is targeted for Phase F (Longevity) or beyond.

---

### A.1 Multi-Graph Agentic Memory (MAGMA Architecture)

**Source**: Jiang et al., *MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents*, arXiv:2601.03236

**Problem**: The current Layer 4 Knowledge Graph uses a single graph with typed edges (`supports`, `refutes`, `extends`, `depends_on`, `supersedes`, `blocks`, `investigative_hypothesis`). All relational information β€” semantic similarity, temporal ordering, causal inference, and entity references β€” is stored in one monolithic edge space. This entangles orthogonal dimensions of reasoning and limits interpretability. When a user asks "Why did the 2023 paper reach a different conclusion?", the system must traverse edges that mix temporal, causal, and semantic relationships without query-adaptive guidance.

**MAGMA's Solution**: Decouple memory representation into four orthogonal relation graphs over a shared node set:

| Graph | Edge Semantics | Use Case in Research OS |
|-------|---------------|------------------------|
| **Temporal Graph** `𝒒_temp` | Strictly ordered pairs `(n_i, n_j)` where `Ο„_i < Ο„_j` | Chronological claim evolution: preprint β†’ VoR β†’ erratum β†’ retraction |
| **Causal Graph** `𝒒_causal` | Directed edges representing logical entailment | "Because method X was used, result Y follows" |
| **Semantic Graph** `𝒒_sem` | Undirected edges: `cos(v_i, v_j) > ΞΈ_sim` | Conceptually similar claims across different papers |
| **Entity Graph** `𝒒_ent` | Bipartite edges: events ↔ abstract entity nodes | Object permanence: "LOD" entity linked to all claims mentioning it |

**Query-Adaptive Traversal**: Instead of static graph lookups, MAGMA formulates retrieval as policy-guided traversal. A Router `β„›` decomposes the user query into structured control signals:

1. **Intent Classification** `T_q ∈ {Why, When, Entity, What}` β€” "Why" queries bias traversal toward `𝒒_causal`; "When" queries bias toward `𝒒_temp`
2. **Temporal Parsing** `[Ο„_s, Ο„_e]` β€” hard time-window filter before graph traversal
3. **Representation Extraction** — dense embedding `q→` for semantic anchor search + sparse keywords for lexical matching

**Anchor Identification**: Multi-signal fusion via Reciprocal Rank Fusion (RRF):
```
S_anchor = Top_K( Σ_{m ∈ {vec, key, time}} 1 / (k + r_m(n)) )
```

**Adaptive Beam Search**: From anchors, expand context using a dynamic transition score:
```
S(n_j | n_i, q) = exp( λ₁ Β· Ο†(type(e_ij), T_q)  [structural alignment]
                     + λ₂ · sim(n→_j, q→) )      [semantic affinity]
```
where `Ο†` rewards edge types matching the query intent (e.g., causal edges for "Why" queries).

**Why This Is a Clear Improvement for the Research OS**:
- The system already stores temporal, causal, and entity information β€” but crammed into a single `edge_type` column. MAGMA's separation makes each dimension independently queryable and interpretable.
- Long-horizon reasoning across hundreds of papers requires chronological traversal ("what did we believe in 2020 vs 2024?"), causal traversal ("what methods caused this result?"), and semantic traversal ("what else is like this?") β€” a single graph forces all three into one edge space.
- The policy-guided router aligns retrieval with the user's actual intent, rather than returning generic nearest-neighbor results.
- Experiments on LoCoMo (9K-token avg. conversations) and LongMemEval (100K+ token contexts) show consistent outperformance vs. monolithic memory baselines.

**Implementation Path**:
- Phase 1: Extend `graph_edges` schema to support `graph_id ∈ {semantic, temporal, causal, entity}` (SQLite migration)
- Phase 2: Implement Router `β„›` as a lightweight classifier (can reuse Qwen3-8B with a classification head)
- Phase 3: Replace static `get_edges()` with policy-guided traversal engine
- Phase 4: Add adaptive `λ₁, Ξ»β‚‚` weights tuned on researcher query logs

---

### A.2 Post-Transformer Model Architecture: The Linear-Scaling Era

**Sources**:
- Gu & Dao, *Mamba: Linear-Time Sequence Modeling with Selective State Spaces*, 2023
- Peng et al., *RWKV: Reinventing RNNs for the Transformer Era*, 2023
- Team et al., *Jamba: A Hybrid Transformer-Mamba Language Model*, 2024
- DeepSeek-AI, *DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model*, 2024
- Nazari et al., *The Curious Case of In-Training Compression of State Space Models (CompreSSM)*, arXiv:2510.02823

**Problem**: The current Research OS is built on decoder-only Transformers (Qwen2.5-3B β†’ Qwen3-8B). For 128K-context paper ingestion, the Transformer faces three scaling walls that become exponentially worse as the knowledge base grows:

| Wall | Transformer Behavior | Impact on Research OS |
|------|---------------------|----------------------|
| **Memory Wall** | KV cache grows linearly with sequence length: `2 Γ— n_layers Γ— n_heads Γ— d_head Γ— seq_len` bytes per batch item | At 128K context, KV cache alone consumes ~3.8GB. Processing 10 papers simultaneously exhausts 16GB VRAM before model weights are counted. |
| **Compute Wall** | Self-attention is `O(nΒ²)` in sequence length. Doubling a paper's length quadruples attention compute. | Ingesting a 200K-token supplement (not uncommon in genomics) is 4Γ— slower than a 100K-token paper, not 2Γ—. |
| **Energy Wall** | Every new token requires attending to ALL previous tokens, even if 99% are irrelevant. | Long-term batch processing of paper libraries becomes prohibitively expensive on consumer hardware. |

**The Post-Transformer Landscape**: Four validated architecture families replace the `O(nΒ²)` bottleneck with `O(n)` or sub-quadratic scaling:

#### A.2.1 State Space Models (SSMs) β€” Mamba Family

**Core Mechanism**: Instead of "looking back" at every previous token (attention), SSMs compress history into a hidden state vector `h(k+1) = AΒ·h(k) + BΒ·x(k)`. The state acts as a "speed-reader's memory" β€” a compressed summary of everything seen so far.

**Why It Fits the Research OS**:
- **5Γ— throughput** on consumer GPUs for long sequences (confirmed in Mamba benchmarks)
- **Constant memory** during inference: state dimension is fixed regardless of sequence length. No KV cache.
- **Genomic-scale sequences**: Mamba handles 1M+ token contexts (e.g., full genome sequences, large supplement bundles)
- **Energy efficiency**: State updates are matrix-vector products, not matrix-matrix attention operations

**CompreSSM Enhancement (arXiv:2510.02823)**: A principled in-training compression framework for SSMs. Using Hankel singular value (HSV) analysis from control theory, CompreSSM identifies which state dimensions carry meaningful signal and surgically truncates low-energy dimensions during training. Key insight: SSMs trained large then compressed during training retain task-critical structure that models trained directly at small dimension lose.

**Implication for Research OS Training**: If the system migrates to an SSM backbone (e.g., a Mamba-based encoder for paper ingestion), CompreSSM enables:
- Start with a large state dimension (e.g., 256) for fast convergence
- Apply balanced truncation at fixed intervals during the first 10% of training
- End with a compact model (e.g., 32-dimensional state) that matches or exceeds the large model's performance
- **Wall-clock speedup**: Empirically validated ~2-4Γ— faster training for equivalent final accuracy

#### A.2.2 RWKV β€” Parallel Training, RNN Inference

**Core Mechanism**: Receptance-Weighted Key-Value combines Transformer-like parallel training with RNN-like constant-memory inference. Uses a time-mixing formulation that decays past information exponentially (like an EMA filter), so distant tokens contribute less without explicit attention computation.

**Why It Fits**:
- **Constant memory during inference**: `O(1)` memory per layer, independent of conversation length
- **Fast inference**: 1 token/sec stays 1 token/sec at turn 1,000 (unlike Transformers, which slow as KV cache grows)
- **Good for interactive UI**: The Courtroom UI and Manual Synthesis Mode require responsive inference during long sessions

#### A.2.3 Hybrid Models β€” Jamba / Griffin

**Core Mechanism**: Interleave a few Transformer layers (for precise short-range "sharpness") with many SSM or recurrent layers (for cheap long-range memory). Jamba uses 1 Transformer layer per 7 Mamba layers. Griffin uses Gated Linear Recurrent layers with local attention.

**Why It Fits**:
- **Best of both worlds**: Transformer layers handle precise claim-to-claim attention within a paragraph; SSM layers handle document-wide context compression
- **Production-grade**: Jamba is already deployed at scale; Griffin powers Gemma-2's long-context variant
- **Minimal migration cost**: Can reuse existing Transformer-trained weights for the attention layers while adding SSM layers

#### A.2.4 Mixture-of-Experts (MoE)

**Core Mechanism**: Sparse activation. A 30B-parameter model activates only 3B parameters per token. Each token is routed to 1-2 "expert" sub-networks. The remaining 27B parameters are dormant for that token.

**Why It Fits**:
- **Huge model, tiny compute**: Quality of a 14B+ dense model with the inference cost of a 3B model
- **Already in design**: The SYSTEM_DESIGN.md already mentions Qwen3-30B-A3B MoE as a "stealth option"
- **Specialization potential**: Different experts could specialize per scientific domain (biochemistry, materials science, quantum computing) β€” a natural fit for the domain taxonomy

---

### A.3 Why Migrate? The 128K-Context Reality

The Research OS targets 128K-token contexts (~100 pages of dense scientific text). At this scale, the Transformer quadratic bottleneck is not theoretical β€” it is the primary hardware constraint:

| Metric | Transformer (Qwen3-8B) | Mamba-2 (2.8B) | RWKV-6 (3B) | Jamba (8B) |
|--------|----------------------|----------------|-------------|------------|
| Context Scaling | `O(nΒ²)` | `O(n)` | `O(n)` | `O(n)` hybrid |
| KV Cache at 128K | ~3.8 GB | **None** | **None** | ~0.5 GB |
| Throughput (128K β†’ 128K) | 1.0Γ— baseline | **5.2Γ—** | 4.1Γ— | 3.5Γ— |
| Inference Memory Growth | Linear | **Constant** | **Constant** | Sub-linear |
| Training Stability | Mature | Good (CompreSSM helps) | Good | Good |

**Recommendation**: The Research OS should plan a **gradual migration** rather than a hard switch:

1. **Short-term (Phase D/E)**: Continue with Qwen3-8B Transformer. The ecosystem (AWQ quantization, vLLM serving, GRPO training in TRL) is mature. The design already targets this.

2. **Medium-term (Phase F)**: Integrate a **hybrid model** as an optional ingestion backbone. A Jamba-style architecture (e.g., fine-tuning a hybrid model on the existing SFT dataset) can be tested alongside the Transformer. The Council architecture (Layer 2) is model-agnostic β€” it calls an API, not a specific architecture.

3. **Long-term (Year 2+)**: If the knowledge base grows to 1,000+ papers and batch ingestion becomes the norm, migrate the ingestion pipeline to an **SSM backbone** (Mamba-2 or RWKV). The claim extraction and epistemic classification tasks map cleanly to SSM sequence modeling. CompreSSM-style in-training compression would reduce training costs for domain adaptation.

4. **Companion Brain**: The Meta-Improver and external scanning agents (which touch the internet) can continue using frontier Transformer APIs (Claude, GPT-4o). Only the **local Primary Brain** (which processes raw paper text and must handle 128K contexts) benefits from the architecture migration.

---

### A.4 Summary: What to Add to the Implementation Roadmap

| Phase | Addition | Rationale |
|-------|----------|-----------|
| **Phase C (Weeks 13-20)** | Extend graph schema to multi-graph (temporal, causal, semantic, entity) | MAGMA separation improves interpretability and query accuracy. SQLite can support this with `graph_id` column + composite indexes. |
| **Phase D (Weeks 21-26)** | Add policy-guided traversal prototype for Layer 4 | Router + adaptive beam search for "Why" and "When" queries. Lightweight β€” does not require new model training. |
| **Phase F (Ongoing)** | Evaluate hybrid model (Jamba-style) for Primary Brain | Test on a holdout paper set. Compare extraction recall, epistemic accuracy, and VRAM usage vs. Qwen3-8B baseline. |
| **Phase F (Ongoing)** | If hybrid evaluation succeeds, add SSM/MoE model options to deployment config | `ollama pull jamba-research-os` or equivalent. Keep Transformer as default for stability. |
| **Year 2+** | Explore CompreSSM for in-training compression if training custom SSM domain models | Only if the project graduates to training its own backbone rather than fine-tuning off-the-shelf models. |

---

*This appendix was added on 2026-04-23 based on peer-reviewed research. All claims are attributed to specific papers. The Research OS v2.0 design remains valid; these are forward-looking enhancements for Phase F and beyond.*


---

## Appendix B: Prior Art Integration β€” Lessons from 15 Similar Systems

**Date Added**: 2026-04-23
**Status**: ACTIONABLE β€” Maps each external system to specific PhD Research OS layers
**Source**: Comprehensive prior art analysis of 15 published systems across 6 capability areas

> **For the full analysis, see [PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md) and [SYSTEM_INSPIRATIONS.md](SYSTEM_INSPIRATIONS.md).**

### B.1 Systems Analyzed

We searched research papers, open-source code, commercial products, and HuggingFace repositories to find every system that overlaps with PhD Research OS. Nobody has built the complete system we've designed, but every piece exists somewhere. Here's how the landscape maps to our architecture.

```
                    PhD Research OS vs. The World
                    
Layer 0 (Parse)     ← Nougat (Meta), GROBID, Marker β€” ADOPT directly
Layer 1 (Resolve)   ← Semantic Scholar API, CrossRef β€” ADOPT as data sources  
Layer 2 (Extract)   ← PaperQA2's RCS technique β€” ADAPT for pre-extraction filtering
                    ← KGX3's language-game filters β€” ADAPT as epistemic trigger words
                    ← Paper Circle's Coverage Checker β€” ADAPT as Completeness Auditor
                    ← CritiCal's self-critique β€” ADAPT for Council workflow
Layer 3 (Dedup)     ← SPECTER2 (AllenAI) β€” ADOPT directly for embeddings
Layer 4 (Graph)     ← SciBERT-NLI β€” ADOPT as fast contradiction pre-filter
                    ← CLAIRE's investigation loop β€” ADAPT for deep conflict analysis
                    ← SciERC's relation taxonomy β€” ADOPT for structural edge types
Layer 5 (Score)     ← CLUE's uncertainty explanation β€” INSPIRE confidence explanations
                    ← NEW: Epistemic Velocity Tracking (inspired by CLAIRE + PaperQA2)
Layer 6 (Evaluate)  ← SciFact benchmark β€” ADOPT as evaluation standard
                    ← SciRIFF training data β€” ADOPT for model training
Layer 7 (Export)    ← ORKG's human contribution model β€” INSPIRE feedback loops
                    ← NEW: Epistemic Provenance Levels (inspired by Paper Circle + ORKG)
```

### B.2 Direct Adoptions β€” Tools to Plug In

| Tool | HuggingFace / GitHub | Target Layer | What It Fixes |
|------|---------------------|-------------|---------------|
| **SPECTER2** | [`allenai/specter2_base`](https://huggingface.co/allenai/specter2_base) | Layer 3 | Replaces word-overlap dedup with meaning-based dedup |
| **SciFact** | [`bigbio/scifact`](https://huggingface.co/datasets/bigbio/scifact) | Layer 6 | Gives us a standard benchmark for claim verification |
| **SciRIFF** | [`allenai/SciRIFF`](https://huggingface.co/datasets/allenai/SciRIFF) | Training | 137K expert examples β†’ 72Γ— our current data |
| **Nougat** | [`facebook/nougat-base`](https://huggingface.co/facebook/nougat-base) | Layer 0 | Fixes equation parsing (garbled β†’ proper LaTeX) |
| **SciBERT-NLI** | [`gsarti/scibert-nli`](https://huggingface.co/gsarti/scibert-nli) | Layer 4 | Fast contradiction pre-filter (check 500K pairs cheaply) |

### B.3 Adapted Techniques β€” Rebuild for Our Needs

| Technique | Source System | Our Adaptation | Target Layer |
|-----------|-------------|---------------|-------------|
| **RCS (Rerank + Contextual Summarize)** | PaperQA2 | Pre-Extraction Filter: score chunks for claim density before Council | Layer 2 (pre-processing) |
| **Deterministic language-game filters** | KGX3 | Epistemic Trigger Words: rule-based validator alongside AI classification | Layer 2 (validation) |
| **Coverage Checker** | Paper Circle | Completeness Auditor: verify nothing was silently omitted | Layer 2 (post-processing) |
| **Refuse to answer** | PaperQA2 | Low Confidence Quarantine: claims below 0.3 β†’ separate queue | Layers 2, 4, 7 |
| **Dual evidence checking** | FactReview | Cross-Reference Verification: check against both paper and knowledge graph | Between Layers 2-4 |
| **Investigation loop** | CLAIRE | Conflict Investigation Protocol: deep analysis before flagging contradictions | Layer 4 |
| **Self-critique for calibration** | CritiCal | Council Self-Critique step: Extractor writes uncertainty BEFORE Critic reviews | Layer 2 |

### B.4 New Features Inspired by Prior Art

| Feature | Inspired By | What It Does |
|---------|-----------|-------------|
| **Epistemic Velocity** | CLAIRE + PaperQA2 | Tracks how claim confidence changes over time (rising/falling/volatile) |
| **Devil's Advocate Mode** | CLAIRE + KGX3 | Automatically challenges high-confidence claims with counter-evidence |
| **Epistemic Provenance Levels** | ORKG + Paper Circle | Tracks human verification level (0=unreviewed β†’ 4=peer-reviewed) |
| **Confidence Decomposition Display** | CLUE | Shows WHY a score is what it is, not just the number |

### B.5 What Makes Us Unique (Confirmed by Analysis)

After analyzing all 15 systems, three capabilities exist in NO published open-source system:

1. **Claim-level epistemic labels** β€” KGX3 classifies whole papers (like rating a restaurant). We classify individual claims (like rating each dish). Nobody else does claim-level with a persistent KG.

2. **Code-computed calibrated confidence** β€” Every other system either asks the AI "how confident are you?" (PaperQA2) or gives binary labels (SciFact's SUPPORTS/REFUTES). Our 3-score formula computed by Python code, where the AI provides raw components but never touches the final number, is unique.

3. **The integrated local-first 7-layer pipeline** β€” PaperQA2 does retrieval + QA (no persistent KG). Paper Circle does KG construction (no epistemic labels). AgentSLR does systematic reviews (no KG at all). Nobody combines all 7 layers into one local-first privacy-preserving system.

### B.6 Implementation Phases for Prior Art Integration

These integrate into the existing Phase A-F timeline from Section 8:

| Phase | Prior Art Integration | Aligns With |
|-------|----------------------|-------------|
| A (Weeks 1-6) | DA-4: Nougat integration, DA-1: SPECTER2 for embeddings | Foundation |
| B (Weeks 7-12) | DA-2: SciFact benchmark, DA-3: SciRIFF training data | Identity |
| C (Weeks 13-20) | DA-5: SciBERT-NLI pre-filter, AD-6: Investigation Protocol | Structure |
| D (Weeks 21-26) | AD-1: Pre-Extraction Filter (RCS), AD-3: Epistemic Trigger Words | Calibration |
| E (Weeks 27-32) | AD-4: Completeness Auditor, NF-2: Devil's Advocate Mode | Judgment |
| F (Ongoing) | NF-1: Epistemic Velocity, NF-3: Provenance Levels | Longevity |

---

*Appendix B added 2026-04-23. Based on analysis of 15 published systems, 12 open-source tools, and 14+ HuggingFace resources. Full details in [PRIOR_ART_ANALYSIS.md](PRIOR_ART_ANALYSIS.md) and [SYSTEM_INSPIRATIONS.md](SYSTEM_INSPIRATIONS.md).*