ASTERIZER
/

LUNA-Training

Model card Files Files and versions

xet

Community

ASTERIZER commited on 18 days ago

Commit

17a6b92

verified ·

1 Parent(s): 1111c5e

Upload Base/Datasets/rag_mcp_sft/sample_preview.json with huggingface_hub

Browse files

Files changed (1) hide show

Base/Datasets/rag_mcp_sft/sample_preview.json +264 -0

Base/Datasets/rag_mcp_sft/sample_preview.json ADDED Viewed

	@@ -0,0 +1,264 @@

+[
+  {
+    "instruction": "Explain 'Chunking strategy' in practical terms.",
+    "input": "Audience: a junior ML engineer\nFocus: definition and purpose\nStyle rule: Use plain English with no jargon overload.\nLength target: 220-320 words",
+    "output": "The concept 'Chunking strategy' can be understood like this: Chunking balances retrieval precision against context completeness by dividing documents into useful units. Large chunks preserve context but can dilute retrieval precision. Small chunks improve focus but may lose context. Overlap can preserve continuity across boundaries. Good practice is to align chunking with document structure when possible.",
+    "meta": {
+      "kind": "qna",
+      "topic": "RAG",
+      "sources": [
+        "rag_chunking"
+      ],
+      "token_count": 131
+    }
+  },
+  {
+    "instruction": "Compare 'Why MCP matters' and 'Ingestion stage'.",
+    "input": "Audience: a solutions engineer\nFocus: system design choices\nStyle rule: Focus on operational tradeoffs, not marketing language.\nLength target: 220-320 words",
+    "output": "'Why MCP matters' and 'Ingestion stage' solve different problems, even when they appear in the same AI stack. For 'Why MCP matters': MCP reduces integration cost by giving clients and servers a shared protocol for context exchange and actions. Developers can build once and integrate across multiple hosts. For 'Ingestion stage': The ingestion stage prepares authoritative documents so retrieval can find them efficiently later. Fresh content should trigger incremental updates rather than full rebuilds when possible. Use them together only when the product needs both grounded knowledge and structured external capabilities. A sensible rule is to publish precise capability schemas and to keep a documented ingestion contract.",
+    "meta": {
+      "kind": "comparison",
+      "topic": "MCP+RAG",
+      "sources": [
+        "mcp_value",
+        "rag_ingestion"
+      ],
+      "token_count": 193
+    }
+  },
+  {
+    "instruction": "Describe 'Tool execution' in practical terms.",
+    "input": "Audience: an enterprise architect\nFocus: definition and purpose\nStyle rule: Keep the answer practical and free of hype.",
+    "output": "The concept 'Tool execution' can be understood like this: After discovery, a client can call a tool with structured arguments and receive structured content back. Responses can contain structured content, not just plain text. Arguments should conform to the tool's JSON schema. Good practice is to return actionable, well-typed results.",
+    "meta": {
+      "kind": "qna",
+      "topic": "MCP",
+      "sources": [
+        "mcp_tool_execution"
+      ],
+      "token_count": 117
+    }
+  },
+  {
+    "instruction": "Explain 'Lifecycle and initialization' in practical terms.",
+    "input": "Audience: a product manager\nFocus: system design choices\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.\nLength target: 120-180 words",
+    "output": "The concept 'Lifecycle and initialization' can be understood like this: MCP is stateful and begins with an initialization exchange that negotiates protocol version and capabilities. Initialization establishes compatibility between client and server. Capability negotiation tells each side which primitives are supported. Identity fields help with debugging and version tracking. Good practice is to emit readiness only after initialization succeeds, then persist negotiated capabilities in connection state.",
+    "meta": {
+      "kind": "qna",
+      "topic": "MCP",
+      "sources": [
+        "mcp_lifecycle"
+      ],
+      "token_count": 145
+    }
+  },
+  {
+    "instruction": "When should a team choose 'Tool discovery' and 'Security and governance'.",
+    "input": "Audience: a solutions engineer\nFocus: system design choices\nStyle rule: Use plain English with no jargon overload.\nLength target: 120-180 words",
+    "output": "'Tool discovery' and 'Security and governance' solve different problems, even when they appear in the same AI stack. For 'Tool discovery': Clients discover tools through list operations before attempting execution. Hosts can build a unified tool registry from multiple servers. For 'Security and governance': MCP integrations still need authentication, authorization, scope control, and careful data handling. Servers should expose only the minimum necessary capabilities. Use them together only when the product needs both grounded knowledge and structured external capabilities. A sensible rule is to use descriptive tool names and to separate local trusted tools from remote untrusted surfaces.",
+    "meta": {
+      "kind": "comparison",
+      "topic": "MCP+MCP",
+      "sources": [
+        "mcp_tool_discovery",
+        "mcp_security"
+      ],
+      "token_count": 188
+    }
+  },
+  {
+    "instruction": "Create a clean explainer for 'Retrieval as an MCP server' for an engineering handbook.",
+    "input": "Audience: a solutions engineer\nFocus: definition and purpose\nStyle rule: Focus on operational tradeoffs, not marketing language.\nLength target: 120-180 words",
+    "output": "The topic 'Retrieval as an MCP server' matters because a retrieval engine can be packaged as an MCP server so hosts can discover search tools and evidence resources dynamically.\nKey points:\n- This keeps retrieval integration reusable across hosts.\n- Retrieved passages can be returned as typed content with metadata.\n- An MCP server can expose search tools with structured query schemas.\nWhy teams care:\n- It supports retrieval-backed agent tools.\n- It supports multi-host retrieval reuse.\n- It supports shared search backends.\nWhat to avoid:\n- Large unranked result sets inflate token cost.",
+    "meta": {
+      "kind": "description",
+      "topic": "Bridge",
+      "sources": [
+        "bridge_retrieval_server"
+      ],
+      "token_count": 201
+    }
+  },
+  {
+    "instruction": "How would you apply 'MCP plus RAG' for a real product team?",
+    "input": "Audience: a junior ML engineer\nFocus: deployment tradeoffs\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.\nLength target: 220-320 words",
+    "output": "For a team building around retrieval-backed IDE helpers, start by treating 'MCP plus RAG' as an engineering system rather than a vague AI feature.\nMCP and RAG solve different layers of the stack: RAG grounds answers with retrieved evidence, while MCP standardizes how tools and context services are connected. MCP is an interoperability protocol. RAG is a retrieval-and-generation pattern.\nA practical rollout would:\n- Keep retrieval quality metrics separate from protocol compatibility metrics.\n- Expose retrieval services through clear MCP schemas when integrating agents.",
+    "meta": {
+      "kind": "scenario",
+      "topic": "Bridge",
+      "sources": [
+        "bridge_mcp_rag"
+      ],
+      "token_count": 189
+    }
+  },
+  {
+    "instruction": "Create an implementation checklist for 'Why MCP matters'.",
+    "input": "Audience: a product manager\nFocus: system design choices\nStyle rule: Make the answer useful for implementation planning.\nLength target: 4-6 bullet points",
+    "output": "Use this checklist when working with 'Why MCP matters':\n- Design for portability across hosts.\n- Publish precise capability schemas.\n- Treat auth and permissions as first-class concerns.\nWatch-outs:\n- Overexposing server capabilities can increase risk.\n- Lack of schema clarity makes tool use unreliable.",
+    "meta": {
+      "kind": "checklist",
+      "topic": "MCP",
+      "sources": [
+        "mcp_value"
+      ],
+      "token_count": 127
+    }
+  },
+  {
+    "instruction": "Help me understand 'Embeddings and vector search' in practical terms.",
+    "input": "Audience: a product manager\nFocus: how to explain it to a team\nStyle rule: Focus on operational tradeoffs, not marketing language.",
+    "output": "The concept 'Embeddings and vector search' can be understood like this: Embeddings turn text into vectors so semantically related passages can be retrieved even when wording differs. Vector search is often paired with metadata filtering. The embedding model should match the language and domain of the corpus. Good practice is to version embedding models and indexes together.",
+    "meta": {
+      "kind": "qna",
+      "topic": "RAG",
+      "sources": [
+        "rag_embeddings"
+      ],
+      "token_count": 128
+    }
+  },
+  {
+    "instruction": "Design a practical approach for 'RAG evaluation' for a real product team?",
+    "input": "Audience: a solutions engineer\nFocus: deployment tradeoffs\nStyle rule: Focus on operational tradeoffs, not marketing language.",
+    "output": "For a team building around release validation, start by treating 'RAG evaluation' as an engineering system rather than a vague AI feature.\nRAG quality should be measured with representative queries, known-good answers, and retrieval diagnostics. A fixed validation set helps track regressions over time. Ground truth is essential for knowing whether improvements are real. Evaluation should cover both retrieval quality and answer quality.\nA practical rollout would:\n- Track citation accuracy and insufficiency handling.\n- Review failures by retrieval stage and answer stage.",
+    "meta": {
+      "kind": "scenario",
+      "topic": "RAG",
+      "sources": [
+        "rag_evaluation"
+      ],
+      "token_count": 164
+    }
+  },
+  {
+    "instruction": "Teach me 'Definition of MCP' in practical terms.",
+    "input": "Audience: an enterprise architect\nFocus: deployment tradeoffs\nStyle rule: Focus on operational tradeoffs, not marketing language.",
+    "output": "The concept 'Definition of MCP' can be understood like this: The Model Context Protocol is an open standard that lets AI applications connect to external tools, resources, and workflows. MCP standardizes integration between AI applications and external systems. The design goal is interoperability instead of custom one-off connectors. It supports tools, data resources, and reusable prompts. Good practice is to use MCP when you need structured tool or context integration, then design tools for clarity and predictable behavior.",
+    "meta": {
+      "kind": "qna",
+      "topic": "MCP",
+      "sources": [
+        "mcp_definition"
+      ],
+      "token_count": 152
+    }
+  },
+  {
+    "instruction": "How should a team use 'Embeddings and vector search' for a real product team?",
+    "input": "Audience: an enterprise architect\nFocus: deployment tradeoffs\nStyle rule: Keep the answer practical and free of hype.",
+    "output": "For a team building around similar policy lookup, start by treating 'Embeddings and vector search' as an engineering system rather than a vague AI feature.\nEmbeddings turn text into vectors so semantically related passages can be retrieved even when wording differs. Dense embeddings capture semantic similarity. The embedding model should match the language and domain of the corpus.\nA practical rollout would:\n- Version embedding models and indexes together.\n- Benchmark retrieval before production rollout.",
+    "meta": {
+      "kind": "scenario",
+      "topic": "RAG",
+      "sources": [
+        "rag_embeddings"
+      ],
+      "token_count": 155
+    }
+  },
+  {
+    "instruction": "What does good implementation of 'Security and governance' look like for a real product team?",
+    "input": "Audience: a backend engineer\nFocus: deployment tradeoffs\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.\nLength target: 180-260 words",
+    "output": "For a team building around enterprise governance, start by treating 'Security and governance' as an engineering system rather than a vague AI feature.\nMCP integrations still need authentication, authorization, scope control, and careful data handling. Remote servers should use appropriate authentication mechanisms. Operational logs and tool arguments can contain sensitive data. Servers should expose only the minimum necessary capabilities.\nA practical rollout would:\n- Audit tool usage and sensitive accesses.\n- Separate local trusted tools from remote untrusted surfaces.",
+    "meta": {
+      "kind": "scenario",
+      "topic": "MCP",
+      "sources": [
+        "mcp_security"
+      ],
+      "token_count": 177
+    }
+  },
+  {
+    "instruction": "Write a description of 'Why base models need retrieval' for an engineering handbook.",
+    "input": "Audience: a product manager\nFocus: definition and purpose\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.",
+    "output": "The topic 'Why base models need retrieval' matters because standalone language models are limited by knowledge cutoffs, incomplete domain depth, and lack of private data access.\nKey points:\n- Specialized domains often require evidence that was sparse or absent in pretraining.\n- A pretrained model cannot know events or documents added after training.\n- Private company knowledge should not be expected to appear in a public base model.\nWhy teams care:\n- It supports product documentation bots.\n- It supports policy assistants.\n- It supports current-events question answering.\nWhat to avoid:\n- Without citations, error diagnosis is harder.\n- Users may trust fluent but unsupported answers.",
+    "meta": {
+      "kind": "description",
+      "topic": "RAG",
+      "sources": [
+        "rag_limitations"
+      ],
+      "token_count": 203
+    }
+  },
+  {
+    "instruction": "Write a description of 'MCP plus RAG' for an engineering handbook.",
+    "input": "Audience: an enterprise architect\nFocus: definition and purpose\nStyle rule: Make the answer useful for implementation planning.\nLength target: 180-260 words",
+    "output": "The topic 'MCP plus RAG' matters because mCP and RAG solve different layers of the stack: RAG grounds answers with retrieved evidence, while MCP standardizes how tools and context services are connected.\nKey points:\n- MCP is an interoperability protocol.\n- RAG is a retrieval-and-generation pattern.\n- A retrieval service can be exposed through MCP as a tool or resource.\nWhy teams care:\n- It supports agentic research assistants.\n- It supports retrieval-backed IDE helpers.\nWhat to avoid:\n- Confusing protocol choice with answer quality leads to bad planning.",
+    "meta": {
+      "kind": "description",
+      "topic": "Bridge",
+      "sources": [
+        "bridge_mcp_rag"
+      ],
+      "token_count": 195
+    }
+  },
+  {
+    "instruction": "Help me understand 'Hosts, clients, and servers' in practical terms.",
+    "input": "Audience: an AI platform lead\nFocus: how to explain it to a team\nStyle rule: Keep the answer practical and free of hype.",
+    "output": "The concept 'Hosts, clients, and servers' can be understood like this: MCP follows a client-server model where a host application manages one client connection per server. The host is the AI application coordinating the experience. The server exposes context or actions to the client. The client manages the protocol connection to a server. Good practice is to store capability metadata per connection, then map host, client, and server roles explicitly in architecture docs.",
+    "meta": {
+      "kind": "qna",
+      "topic": "MCP",
+      "sources": [
+        "mcp_architecture"
+      ],
+      "token_count": 151
+    }
+  },
+  {
+    "instruction": "What is 'Notifications and live updates' in practical terms.",
+    "input": "Audience: an AI platform lead\nFocus: definition and purpose\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.",
+    "output": "The concept 'Notifications and live updates' can be understood like this: Notifications let servers tell clients about changes, such as tool list updates, without waiting for a request. Notifications are one-way JSON-RPC messages with no response expected. They are useful when available capabilities change dynamically. Good practice is to tie notification handling to cache refresh logic.",
+    "meta": {
+      "kind": "qna",
+      "topic": "MCP",
+      "sources": [
+        "mcp_notifications"
+      ],
+      "token_count": 130
+    }
+  },
+  {
+    "instruction": "How should a team use 'Why MCP matters' for a real product team?",
+    "input": "Audience: a solutions engineer\nFocus: deployment tradeoffs\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.\nLength target: 4-6 bullet points",
+    "output": "For a team building around cross-platform agent integrations, start by treating 'Why MCP matters' as an engineering system rather than a vague AI feature.\nMCP reduces integration cost by giving clients and servers a shared protocol for context exchange and actions. Developers can build once and integrate across multiple hosts. End users benefit from more capable assistants with fewer brittle integrations. AI applications gain structured access to data and actions.\nA practical rollout would:\n- Design for portability across hosts.\n- Treat auth and permissions as first-class concerns.",
+    "meta": {
+      "kind": "scenario",
+      "topic": "MCP",
+      "sources": [
+        "mcp_value"
+      ],
+      "token_count": 183
+    }
+  },
+  {
+    "instruction": "Write a description of 'Notifications and live updates' for an engineering handbook.",
+    "input": "Audience: a solutions engineer\nFocus: definition and purpose\nStyle rule: Explain the idea cleanly without drifting into unrelated topics.\nLength target: 180-260 words",
+    "output": "The topic 'Notifications and live updates' matters because notifications let servers tell clients about changes, such as tool list updates, without waiting for a request.\nKey points:\n- Notifications are one-way JSON-RPC messages with no response expected.\n- They are useful when available capabilities change dynamically.\n- Clients typically refresh their local registry after a change notification.\nWhy teams care:\n- It supports live server state updates.\n- It supports responsive agent interfaces.\n- It supports dynamic tool catalogs.\nWhat to avoid:\n- Capabilities should declare whether list-changed notifications are supported.\n- Sending change events without rate control can create noise.",
+    "meta": {
+      "kind": "description",
+      "topic": "MCP",
+      "sources": [
+        "mcp_notifications"
+      ],
+      "token_count": 209
+    }
+  },
+  {
+    "instruction": "Teach me 'Tool discovery' in practical terms.",
+    "input": "Audience: a product manager\nFocus: security and governance\nStyle rule: Use plain English with no jargon overload.",
+    "output": "The concept 'Tool discovery' can be understood like this: Clients discover tools through list operations before attempting execution. Discovery avoids guessing what a server supports. Hosts can build a unified tool registry from multiple servers. Good practice is to use descriptive tool names.",
+    "meta": {
+      "kind": "qna",
+      "topic": "MCP",
+      "sources": [
+        "mcp_tool_discovery"
+      ],
+      "token_count": 105
+    }
+  }
+]