hash-map commited on
Commit
2830c66
·
verified ·
1 Parent(s): f3ef8d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -4
README.md CHANGED
@@ -50,12 +50,30 @@ It is **not** a general-purpose LLM and performs poorly on questions without app
50
  A simple **keyword-based lexical retrieval** system is provided to help select relevant context chunks:
51
 
52
  ```python
53
- from collections import Counter
54
  import re
 
 
 
 
55
 
56
  def tokenize(text):
57
  return re.findall(r"\b[a-zA-Z]{3,}\b", text.lower())
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  def retrieve_2_contexts(question, token_to_ctx, contexts):
60
  q_tokens = tokenize(question)
61
  scores = Counter()
@@ -68,7 +86,8 @@ def retrieve_2_contexts(question, token_to_ctx, contexts):
68
  return " ".join([contexts[cid]["text"] for cid in top_ids])
69
  ```
70
 
71
- This is a basic sparse retrieval method (similar to TF-IDF without IDF). For better performance consider switching to dense retrieval (sentence-transformers, ColBERT, etc.) in production.
 
72
 
73
  ## 🧑‍💻 How to Use
74
 
@@ -122,11 +141,12 @@ print(answer_question(context, "Who killed Joffrey Baratheon?"))
122
  - **Hallucinations:** Can still invent facts when context is ambiguous, incomplete or contradictory
123
  - **Toxicity & bias:** Inherits biases present in the base Gemma model + any biases in the GoT dataset (e.g. gender roles, violence portrayal typical of the series)
124
  - **No safety tuning:** No built-in refusal or content filtering
 
125
 
126
  **Recommendations:**
127
- - Always provide rich, accurate context
128
  - Manually verify outputs for important use cases
129
- - Consider adding a guardrail / moderation step in applications
130
 
131
  ## 📚 Citation
132
 
 
50
  A simple **keyword-based lexical retrieval** system is provided to help select relevant context chunks:
51
 
52
  ```python
 
53
  import re
54
+ import json
55
+ from collections import defaultdict, Counter
56
+
57
+ CHUNKS_FILE = "/kaggle/input/got-dataset/contexts.json" # list of {text, source, chunk_id}
58
 
59
  def tokenize(text):
60
  return re.findall(r"\b[a-zA-Z]{3,}\b", text.lower())
61
 
62
+ contexts = []
63
+ token_to_ctx = defaultdict(list)
64
+
65
+ with open(CHUNKS_FILE, "r", encoding="utf-8") as f:
66
+ data = json.load(f)
67
+
68
+ for idx, item in enumerate(data):
69
+ text = item["text"]
70
+ contexts.append(item)
71
+
72
+ for tok in tokenize(text):
73
+ token_to_ctx[tok].append(idx)
74
+
75
+ print(f"Indexed {len(contexts)} chunks")
76
+
77
  def retrieve_2_contexts(question, token_to_ctx, contexts):
78
  q_tokens = tokenize(question)
79
  scores = Counter()
 
86
  return " ".join([contexts[cid]["text"] for cid in top_ids])
87
  ```
88
 
89
+ This is a basic sparse retrieval method (similar to TF-IDF without IDF).
90
+ can create faiss for better retreival using these contexts
91
 
92
  ## 🧑‍💻 How to Use
93
 
 
141
  - **Hallucinations:** Can still invent facts when context is ambiguous, incomplete or contradictory
142
  - **Toxicity & bias:** Inherits biases present in the base Gemma model + any biases in the GoT dataset (e.g. gender roles, violence portrayal typical of the series)
143
  - **No safety tuning:** No built-in refusal or content filtering
144
+ - **Hugging Face Token Necessity** for google gemma model need hf access token. To access the gemma repo,you need hf token
145
 
146
  **Recommendations:**
147
+ - contexts are fine u can try with other retreiver but make sure total token length is <200
148
  - Manually verify outputs for important use cases
149
+ - Consider adding a guardrail/moderation step in applications
150
 
151
  ## 📚 Citation
152