hash-map
/

got_QA_fine_tuned_model

Question Answering

Model card Files Files and versions

hash-map commited on Feb 2

Commit

2830c66

·

verified ·

1 Parent(s): f3ef8d2

Update README.md

Files changed (1) hide show

README.md +24 -4

README.md CHANGED Viewed

@@ -50,12 +50,30 @@ It is **not** a general-purpose LLM and performs poorly on questions without app
 A simple **keyword-based lexical retrieval** system is provided to help select relevant context chunks:
 ```python
-from collections import Counter
 import re
 def tokenize(text):
     return re.findall(r"\b[a-zA-Z]{3,}\b", text.lower())
 def retrieve_2_contexts(question, token_to_ctx, contexts):
     q_tokens = tokenize(question)
     scores = Counter()
@@ -68,7 +86,8 @@ def retrieve_2_contexts(question, token_to_ctx, contexts):
     return " ".join([contexts[cid]["text"] for cid in top_ids])
 ```
-This is a basic sparse retrieval method (similar to TF-IDF without IDF). For better performance consider switching to dense retrieval (sentence-transformers, ColBERT, etc.) in production.
 ## 🧑‍💻 How to Use
@@ -122,11 +141,12 @@ print(answer_question(context, "Who killed Joffrey Baratheon?"))
 - **Hallucinations:** Can still invent facts when context is ambiguous, incomplete or contradictory
 - **Toxicity & bias:** Inherits biases present in the base Gemma model + any biases in the GoT dataset (e.g. gender roles, violence portrayal typical of the series)
 - **No safety tuning:** No built-in refusal or content filtering
 **Recommendations:**
-- Always provide rich, accurate context
 - Manually verify outputs for important use cases
-- Consider adding a guardrail / moderation step in applications
 ## 📚 Citation

 A simple **keyword-based lexical retrieval** system is provided to help select relevant context chunks:
 ```python
 import re
+import json
+from collections import defaultdict, Counter
+CHUNKS_FILE = "/kaggle/input/got-dataset/contexts.json"   # list of {text, source, chunk_id}
 def tokenize(text):
     return re.findall(r"\b[a-zA-Z]{3,}\b", text.lower())
+contexts = []
+token_to_ctx = defaultdict(list)
+with open(CHUNKS_FILE, "r", encoding="utf-8") as f:
+    data = json.load(f)
+for idx, item in enumerate(data):
+    text = item["text"]
+    contexts.append(item)
+    for tok in tokenize(text):
+        token_to_ctx[tok].append(idx)
+print(f"Indexed {len(contexts)} chunks")
 def retrieve_2_contexts(question, token_to_ctx, contexts):
     q_tokens = tokenize(question)
     scores = Counter()
     return " ".join([contexts[cid]["text"] for cid in top_ids])
 ```
+This is a basic sparse retrieval method (similar to TF-IDF without IDF).
+can create faiss for better retreival using these contexts
 ## 🧑‍💻 How to Use
 - **Hallucinations:** Can still invent facts when context is ambiguous, incomplete or contradictory
 - **Toxicity & bias:** Inherits biases present in the base Gemma model + any biases in the GoT dataset (e.g. gender roles, violence portrayal typical of the series)
 - **No safety tuning:** No built-in refusal or content filtering
+- **Hugging Face Token Necessity** for google gemma model need hf access token. To access the gemma repo,you  need hf  token
 **Recommendations:**
+- contexts are fine u can try with other retreiver but make sure total token length is <200
 - Manually verify outputs for important use cases
+- Consider adding a guardrail/moderation step in applications
 ## 📚 Citation