Update README.md
Browse files
README.md
CHANGED
|
@@ -50,12 +50,30 @@ It is **not** a general-purpose LLM and performs poorly on questions without app
|
|
| 50 |
A simple **keyword-based lexical retrieval** system is provided to help select relevant context chunks:
|
| 51 |
|
| 52 |
```python
|
| 53 |
-
from collections import Counter
|
| 54 |
import re
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
def tokenize(text):
|
| 57 |
return re.findall(r"\b[a-zA-Z]{3,}\b", text.lower())
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
def retrieve_2_contexts(question, token_to_ctx, contexts):
|
| 60 |
q_tokens = tokenize(question)
|
| 61 |
scores = Counter()
|
|
@@ -68,7 +86,8 @@ def retrieve_2_contexts(question, token_to_ctx, contexts):
|
|
| 68 |
return " ".join([contexts[cid]["text"] for cid in top_ids])
|
| 69 |
```
|
| 70 |
|
| 71 |
-
This is a basic sparse retrieval method (similar to TF-IDF without IDF).
|
|
|
|
| 72 |
|
| 73 |
## 🧑💻 How to Use
|
| 74 |
|
|
@@ -122,11 +141,12 @@ print(answer_question(context, "Who killed Joffrey Baratheon?"))
|
|
| 122 |
- **Hallucinations:** Can still invent facts when context is ambiguous, incomplete or contradictory
|
| 123 |
- **Toxicity & bias:** Inherits biases present in the base Gemma model + any biases in the GoT dataset (e.g. gender roles, violence portrayal typical of the series)
|
| 124 |
- **No safety tuning:** No built-in refusal or content filtering
|
|
|
|
| 125 |
|
| 126 |
**Recommendations:**
|
| 127 |
-
-
|
| 128 |
- Manually verify outputs for important use cases
|
| 129 |
-
- Consider adding a guardrail
|
| 130 |
|
| 131 |
## 📚 Citation
|
| 132 |
|
|
|
|
| 50 |
A simple **keyword-based lexical retrieval** system is provided to help select relevant context chunks:
|
| 51 |
|
| 52 |
```python
|
|
|
|
| 53 |
import re
|
| 54 |
+
import json
|
| 55 |
+
from collections import defaultdict, Counter
|
| 56 |
+
|
| 57 |
+
CHUNKS_FILE = "/kaggle/input/got-dataset/contexts.json" # list of {text, source, chunk_id}
|
| 58 |
|
| 59 |
def tokenize(text):
|
| 60 |
return re.findall(r"\b[a-zA-Z]{3,}\b", text.lower())
|
| 61 |
|
| 62 |
+
contexts = []
|
| 63 |
+
token_to_ctx = defaultdict(list)
|
| 64 |
+
|
| 65 |
+
with open(CHUNKS_FILE, "r", encoding="utf-8") as f:
|
| 66 |
+
data = json.load(f)
|
| 67 |
+
|
| 68 |
+
for idx, item in enumerate(data):
|
| 69 |
+
text = item["text"]
|
| 70 |
+
contexts.append(item)
|
| 71 |
+
|
| 72 |
+
for tok in tokenize(text):
|
| 73 |
+
token_to_ctx[tok].append(idx)
|
| 74 |
+
|
| 75 |
+
print(f"Indexed {len(contexts)} chunks")
|
| 76 |
+
|
| 77 |
def retrieve_2_contexts(question, token_to_ctx, contexts):
|
| 78 |
q_tokens = tokenize(question)
|
| 79 |
scores = Counter()
|
|
|
|
| 86 |
return " ".join([contexts[cid]["text"] for cid in top_ids])
|
| 87 |
```
|
| 88 |
|
| 89 |
+
This is a basic sparse retrieval method (similar to TF-IDF without IDF).
|
| 90 |
+
can create faiss for better retreival using these contexts
|
| 91 |
|
| 92 |
## 🧑💻 How to Use
|
| 93 |
|
|
|
|
| 141 |
- **Hallucinations:** Can still invent facts when context is ambiguous, incomplete or contradictory
|
| 142 |
- **Toxicity & bias:** Inherits biases present in the base Gemma model + any biases in the GoT dataset (e.g. gender roles, violence portrayal typical of the series)
|
| 143 |
- **No safety tuning:** No built-in refusal or content filtering
|
| 144 |
+
- **Hugging Face Token Necessity** for google gemma model need hf access token. To access the gemma repo,you need hf token
|
| 145 |
|
| 146 |
**Recommendations:**
|
| 147 |
+
- contexts are fine u can try with other retreiver but make sure total token length is <200
|
| 148 |
- Manually verify outputs for important use cases
|
| 149 |
+
- Consider adding a guardrail/moderation step in applications
|
| 150 |
|
| 151 |
## 📚 Citation
|
| 152 |
|