diff --git "a/data/chunks/2603.10726_semantic.json" "b/data/chunks/2603.10726_semantic.json" new file mode 100644--- /dev/null +++ "b/data/chunks/2603.10726_semantic.json" @@ -0,0 +1,1182 @@ +[ + { + "chunk_id": "4c22d1e4-5615-46a7-9a0e-cd0473e4a341", + "text": "CacheSolidarity: Preventing Prefix Caching Side Channels in\nMulti-tenant LLM Serving Systems Panagiotis Georgios Pennas Konstantinos Papaioannou\nIMDEA Software Institute IMDEA Software Institute\nUniversidad Politécnica de Madrid Universidad Politécnica de Madrid\npanagiotis.pennas@imdea.org konstantinos.papaioannou@imdea.org Marco Guarnieri Thaleia Dimitra Doudali\nIMDEA Software Institute IMDEA Software Institute\nmarco.guarnieri@imdea.org thaleia.doudali@imdea.org", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 0, + "total_chunks": 59, + "char_count": 467, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "92b87e19-b8af-42e1-b576-4c78c6136475", + "text": "ABSTRACT Secret Inputs Serving System\nName: **\nLarge Language Models (LLMs) rely on optimizations like Auto-2026 Password: ** Response\nmatic Prefix Caching (APC) to accelerate inference. APC works Shared\nby reusing previously computed states for the beginning part of Victim Address: ** Request Prefix Miss\nEmail: ** Cache Time LLM\na request (prefix), when another request starts with the same text. Request Diff\nResponseMar While APC improves throughput, it introduces timing side chan- Response Hit Generation\n11 nels:differences.cache hitsIn multi-tenantare faster thansystems,misses, attackerscreating observablecan exploitlatencythese Request Response\ndifferences to infer sensitive information, e.g., by incrementally LLM Attacker …\nPrompt FEEDBACK Time\nreconstructing another user's request by observing hit/miss pat- Constructor Analyzer\nterns. Current defenses take a sledgehammer approach: they disable\nAPC and cache sharing, isolating users, and sacrificing efficiency\nFigure 1: Timing side-channel leakage in prefix-sharing LLM\nfor regular users. This paper presents CacheSolidarity, a system\ninference. The attacker sends crafted prompts and measures\nthat secures multi-tenant LLM serving systems against APC side[cs.CR] time-to-first-token (TTFT) to detect cache hits or misses channels without sacrificing performance and efficiency. CacheSolcaused by Automatic Prefix Caching (APC) and steal the senidarity monitors cache reuse across users, flags suspicious sharing,\nsitive information in the victim's prompt.\nand selectively isolates prefixes, restricting their reuse, only when\nnecessary. Evaluation shows that CacheSolidarity enables up to 70%\nhigher cache reuse and 30% lower inference latency compared to While APC delivers substantial performance benefits, it also inexisting defenses that isolate users. CacheSolidarity's lightweight troduces security risks similar to other caching optimizations [12,\ndesign demonstrates how security in LLM serving does not have to 17, 30, 70]. APC creates observable variations in inference latency\ncome at the cost of unnecessarily reduced performance or unbear- depending on whether a request reuses cached prefixes or not.\nable overheads. These latency differences create a timing side channel, that is, an unintended information leak where execution time reveals properties\nof secret data without direct access. In multi-tenant deployments\n1 INTRODUCTION where the prefix cache is shared across different security domains\nLarge Language Models (LLMs) now power applications such as con- (e.g., different users), attackers can exploit this side channel to infer\nversational assistants, code generation, and enterprise analytics [13]. private information. For example, by sending crafted requests and\nThese services operate at massive scale under strict latency and measuring time-to-first-token (TTFT), an attacker can determine\nthroughput requirements, making inference serving a critical sys- whether parts of their input match cached prefixes from anotherarXiv:2603.10726v1\ntems challenge [76]. Modern serving systems employ system-level user's request. Repeating this process enables prompt-stealing atoptimizations such as cache management [39, 40, 59] and schedul- tacks that reconstruct sensitive content from other users [31, 58].\ning [2, 3, 33] to accelerate LLM inference.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 1, + "total_chunks": 59, + "char_count": 3345, + "word_count": 452, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c1d633b3-7767-4a1f-bc3a-52663e3bd73b", + "text": "Among these optimiza- To prevent these attacks, prior work has proposed several detions, prefix sharing [10, 37, 71], or otherwise known as Automatic fenses against APC timing leaks [14, 16, 31, 51, 75]. These apPrefix Caching (APC), is widely used by state-of-the-art frameworks proaches span three main strategies: (i) full user-level isolation,\nand commercial APIs, such as OpenAI [49], DeepSeek [19], Google which disables prefix sharing entirely [31, 51, 75]; (ii) timing obfusGemini [18], MoonShot Kimi [55], vLLM [39], and SGLang [74]. cation, which injects noise to mask latency differences [14]; and (iii)\nAPC accelerates inference by caching and reusing previously com- selective isolation of secret-dependent prompts based on LLM-aided\nputed model states for the beginning part of a request (the request's semantic analysis [16]. While effective in theory, all these defenses\nprefix) whenever another request starts with the same text.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 2, + "total_chunks": 59, + "char_count": 946, + "word_count": 141, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "73554555-fe28-4228-88c3-1284225605a6", + "text": "The suffer from a critical drawback: they introduce significant overhead,\nuse of a prefix cache avoids redundant computation and signifi- often unnecessarily. For example, for prompts that are not involved\ncantly reduces latency for long prompts and multi-turn conversa- in attacks or for situations where the timing side channel is not\ntions [37, 71]. exploitable in practice. As a consequence, these solutions erase", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 3, + "total_chunks": 59, + "char_count": 417, + "word_count": 63, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "22f17842-e3e8-4531-9977-efaaac834eb7", + "text": "APC's performance benefits and penalize regular, benign users; including observations on their exploitability under different\nthereby limiting their applicability in realistic, high-performance workload and system conditions (Section 2).\nmulti-tenant environments. • The design of CacheSolidarity, a lightweight and practical sysTo address the shortcomings of current defenses, we introduce tem that secures prefix reuse against timing attacks by selectively\nCacheSolidarity, a system-level solution that secures multi-tenant isolating suspicious prefixes rather than entire users, while preLLM serving against timing side channels caused by APC, while serving performance benefits (Section 3). CacheSolidarity will\npreserving its performance benefits. CacheSolidarity is based on be open-sourced to allow community adoption.\nthe key insight that isolating entire users (as done in [31, 51, 75]) is • A comprehensive security analysis of CacheSolidarity 's guarnot necessary to prevent attacks; rather it is sufficient to protect the antees and limitations, complemented by empirical validation\nshared prefixes that might lead to attacks. Building on this insight, against prompt-stealing attacks (Section 4). CacheSolidarity (1) continuously monitors prefix reuse across users, • An extensive evaluation of CacheSolidarity across diverse work-\n(2) flags prefixes that are suspiciously reused by multiple users, and loads and state-of-the-art LLM models, demonstrating up to 70%\n(3) prevents further reuse of flagged prefixes in follow-up requests. higher cache reuse and 30% lower latency compared to user-level\nThis approach embodies the principle of cooperative efficiency: isolation defenses, with negligible overhead (Section 5).\nbenign users benefit from shared caching through solidarity, where\ncommon prefixes remain accessible to all. When suspicious behavior\nthreatens this solidarity, CacheSolidarity intervenes by isolating 2 BACKGROUND AND MOTIVATION\nonly the flagged prefixes instead of penalizing all users. In this section, we provide background and motivation for CacheSolThe system design of CacheSolidarity consists of lightweight idarity. We start by introducing background on LLM serving syscomponents that work together to secure prefix caching without tems and, in particular, on the APC optimization (2.1). We then\nsacrificing efficiency. First, CacheSolidarity extends the prefix cache characterize the parameters that influence the exploitability of timwith minimal metadata for each entry to track ownership and flag ing side channels due to APC (Section 2.2).", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 4, + "total_chunks": 59, + "char_count": 2587, + "word_count": 352, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d20ef002-3a32-4705-b263-a0bdfe80e279", + "text": "Then, we illustrate\nsuspicious reuse. Second, the Detector monitors cache hits and iso- how attackers can exploit APC through timing leaks to learn senlates flagged prefixes when multiple users attempt to share them. sitive prompts (2.3). Next, we overview current defenses for APC\nFinally, CacheSolidarity introduces a system-level optimization timing leaks and discuss their limitations (2.4). We conclude by\nthat activates prefix isolation only when a timing side channel is summarizing our motivational observations in (2.5).\nactually exploitable. This is handled by the Activator, which continuously evaluates whether latency differences between cache hits and\nmisses are distinguishable, based on factors such as the hardware 2.1 Caching in LLM Serving Systems\nplatform, LLM model size, system load, and request length. These Large Language Models (LLMs), such as GPT [13] and LLaMA [62],\nparameters directly influence the strength and exploitability of the are the backbone of modern AI services. Integrating them in modtiming side channel, as demonstrated in our motivational analysis. ern IT systems requires serving systems that deliver inference with\nOverall CacheSolidarity's system components interact seamlessly: low latency and high throughput under multi-user workloads. Inthe cache extension provides metadata, the Detector enforces selec- ference happens in two stages: prefill and decode. During prefill, the\ntive isolation, and the Activator optimizes the security-performance entire input prompt (also called a request) is processed in a single\ntrade-off. forward pass, generating key-value (KV) tensors for each token. In\nCacheSolidarity is implemented on top of the open-source, state- the decode phase, tokens are produced autoregressively; for each\nof-the-art LLM serving system vLLM [64]. We conduct an extensive new token, the model computes a new key-value tensor using all\nevaluation across multi-tenant workloads with varying levels of previously generated tensors. Without caching, every decode step\nintra- and inter-user prefix sharing and nine LLM models spanning would recompute all KV tensors for both prefill and prior decode\ndifferent families and sizes (from 0.5B to 13B parameters). Our ex- tokens, incurring quadratic cost.\nperimental results show that CacheSolidarity achieves up to 70% To avoid this, current LLM serving systems employ a KV cache\nhigher cache reuse and 30% lower inference latency compared to de- that stores tensors generated during the prefill and previous decodfenses that enforce user-level isolation. In addition, CacheSolidarity ing steps for reuse. To further accelerate inference, several comintroduces negligible time and memory overheads during runtime, mercial frameworks such as OpenAI [49], DeepSeek [19], Google\ndemonstrating that security can be achieved without sacrificing Gemini [18], MoonShot Kimi [55], vLLM [39], and SGLang [74]\nperformance when implemented in a lightweight and principled implement Automatic Prefix Caching (APC).", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 5, + "total_chunks": 59, + "char_count": 3011, + "word_count": 430, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8d4d2070-4b9e-4f6b-a0bd-4b45f1c90a35", + "text": "Finally, to validate the security of CacheSolidarity, we per- dant computation by reusing cached KV tensors whenever a new\nformed a detailed security analysis that precisely characterizes request shares a prefix with an earlier one. This optimization is\nCacheSolidarity's security guarantees, which we complement with particularly effective for scenarios such as long-document queries\nan empirical validation showing that CacheSolidarity indeed closes or multi-round conversations [65], where repeated processing of\nthe timing side-channel introduced by APC across requests from the same prefix would otherwise incur significant overhead.\ndifferent users. When serving systems apply APC, a prompt can be viewed as a\nThe specific paper contributions are: sequence of prefixes, each mapped to a cache entry in the KV/prefix\n• A detailed analysis of timing side channels introduced by Auto- cache. Requests may experience partial cache hits, starting from the\nmatic Prefix Caching (APC) in multi-tenant LLM serving systems, beginning of the prompt and reusing some cached prefixes while\nrecomputing others. The leftmost example in Figure 5 illustrates", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 7, + "total_chunks": 59, + "char_count": 1148, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "053ea65b-920e-4d28-8b74-de54ebd89a87", + "text": "this process, where each node corresponds to a cache entry for that 60 60\nparticular prefix/part of the prompt sentence. (ms) 40 (ms) 40\n2.2 Timing Differences due to APC TTFT 20 TTFT 20\nAlthough the use of the prefix cache accelerates inference, it introduces observable latency differences between cache hits and 0 200 400 0 200 400\nmisses, particularly in the time-to-first-token (TTFT). These varia- Prompt Length (Tokens) Prompt Length (Tokens)\ntions create a timing side channel, an unintended information leak (a) LLava-0.5B (b) Qwen2.5-3B\nwhere execution time reveals whether a request shares prefixes\nwith cached prompts. In multi-tenant environments, such leaks 60 60\ncan be exploited by attackers to infer sensitive information. In the\nfollowing examples, we illustrate how APC-induced timing differ- (ms) 40 (ms) 40\nences manifest and we identify the key parameters that influence TTFT 20 TTFT 20the strength and exploitability of the side channel. To illustrate how APC impacts latency, recall from 0 200 400 0 200 400\nSection 2.1 that a prompt is processed as a sequence of prefixes, Prompt Length (Tokens) Prompt Length (Tokens)\neach mapped to a cache entry. When APC is enabled, these prefixes (c) Gemma3-4B (d) Llama2-7B\nmay either hit in the cache or require recomputation. We consider 150 150\ntwo cases: one where all prefixes of the prompt hit the cache, and\nanother where a miss occurs early, forcing all subsequent prefixes\n100 100to be recomputed. This difference in reuse leads to observable vari- (ms) (ms)\nations in time-to-first-token (TTFT) between requests that fully TTFT 50 TTFT 50\nreuse cached prefixes and those that recompute them. 2 reports\nTTFT comparisons across four LLMs of increasing size and varying\n0 0prompt lengths, with requests sent at a constant rate (RPS = 1). 200 400 200 400\nPrompt Length (Tokens) Prompt Length (Tokens)\nTTFT for cache hits is shown in red, while misses are shown in\nblue. For all models except the smallest one (i.e., LLava-0.5B), we (e) RPS = 10 (f) RPS = 20\nhighlight the following aspects: 150 150\n• TTFT differences between cache hits and misses become notice- 100 100 able after a certain prefix length for each model. (ms) (ms)\n• The latency gap grows as the shared prefix length increases.\n• For the same prefix length, larger LLMs exhibit greater TTFT TTFT 50 TTFT 50\ndifferences.\n• Larger LLMs show distinguishable timing differences even at 0 0 200 400 200 400 shorter prefixes. Prompt Length (Tokens) Prompt Length (Tokens)\nFor LLava-0.5B, TTFT differences between hits and misses are negligible because the model is small and the recomputation latency (g) RPS = 30 (h) RPS = 40.\nfor the misses is minimal. Figure 2: TTFT difference between cache hits (red) and misses\nObservation 1: TTFT reveals whether a request reuses cached (blue) for increasing length of prefixes/prompts reused across\nprefixes (hits) or recomputes them (misses), and the difference users. Examples for different LLM models and system load\nbecomes more pronounced with larger models and longer pre- (requests per second RPS).\nfixes that are reused across users. Next, we examine how system load influences these\ntiming differences using the same setup as before with the Llama- Observation 2: As system load increases, TTFT differences be-\n13B model. 2 shows TTFT measurements under increasing request- tween prefix cache hits and misses disappear because batching\nper-second (RPS) rates (10, 20, 30, 40) for prompts with prefixes of and queuing delays dominate latency, masking cache-related\n100–500 tokens. We observe that as the request-per-second (RPS) timing variations.\nrate increases, the latency gap between cache hits and misses progressively collapses, making them hard to distinguish.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 8, + "total_chunks": 59, + "char_count": 3747, + "word_count": 602, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bfd6bcc0-6602-4faf-894b-bae1b7423984", + "text": "This happens To further validate observations 1 and 2 across a wide range\nbecause higher RPS introduces resource contention and head-of- of conditions, Figure 3 shows the KDE overlap between the TTFT\nline blocking in the LLM serving pipeline [2, 3, 52, 76]. Under these distributions of cache hits and misses across 5 different LLMs, inconditions, batching and queuing delays dominate latency, over- creasing prefix/prompt length and system load (requests per secshadowing the computational savings of prefix reuse. KDE overlap is a statistical measure of similarity between\ncache hits appear as slow as misses, effectively eliminating any tim- two probability distributions, computed as the integral of the miniing differences. mum of their density functions [34]. In our context, it quantifies", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 9, + "total_chunks": 59, + "char_count": 795, + "word_count": 120, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "850cc175-d383-4f4a-802b-bec290798da9", + "text": "Gemma-4B LLama2-7B Qwen2.5-3B reconstruct sensitive prompts or steal private information (secrets)\nLLama2-13B Llava-0.5B of other users by observing the TTFT of attacker requests. That is,\n100 APC leaks can be exploited to perform prompt stealing attacks [58].\n(%) Prior studies [16] show that datasets capturing real-world LLM\nworkloads, such as ShareGPT [9], Multiturn Chat [11], and Prompt\n50 Multitasks [24], reveal frequent prefix reuse within and across users.Overlap Specifically, requests from the same user share prefixes in 9–60%\nKDE 25 of cases, while up to 30% reuse occurs across different users. Thus,\nAPC leaks are a real threat for the security of multi-tenant LLM 0\n100 200 300 400 20 40 serving systems. Prompt Length (Tokens) Request Rate (req/s)\nFigure 1 illustrates the workflow of a timing-based prompt stealing attack that exploits APC leaks, inspired by [58]. First, the victim\nFigure 3: Effect of the LLM model, prefix/prompt length and\nuser issues a request that includes a prompt containing sensitive insystem load (requests per second) on the distinguishability\nformation. For simplicity, here we assume that the prompt follows a\nof the APC timing differences, which is captured with the\nfixed template (known by the attacker) and it includes user-specific\nKDE overlap between cache hits and misses.\nsecrets. For example, a prompt template can be Compose a meeting\nagenda for an interdisciplinary team discussing the treatment plan\nhow much the TTFT distributions of cache hits and misses inter- for [Name] with [medical condition]), where [Name] and [medical\nsect: higher overlap indicates that hits and misses are harder to condition] are secret. The attacker uses an LLM-based Prompt Condistinguish, while lower overlap means the opposite. The left plot structor to create candidate prompts by varying words within the\nshows that the overlap decreases as the prefix length grows, making template. Next, each candidate is submitted to the LLM serving\ntiming differences more pronounced for longer prefixes, especially system, and the time-to-first-token (TTFT) is measured. By observin larger models such as LLaMA-13B and LLaMA-7B.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 10, + "total_chunks": 59, + "char_count": 2161, + "word_count": 331, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a563720d-7b8a-4119-af5c-eed5e18a8c48", + "text": "In contrast, ing the TTFT, the attacker can determine (e.g., using a dedicated\nsmaller models like Llava-0.5B maintain high overlap even for long pre-trained classifier) whether the observed latency corresponds\nprefixes, meaning that it's hard to distinguish any TTFT differences, to hits or misses in the prefix cache. In particular, a cache hit indias shown previously in Figure 2. The right plot shows that overlap cates that the candidate shares a prefix with the victim's prompt.\nincreases with request rate (RPS), as batching and queuing delays By repeating this process iteratively, the attacker can reconstruct\ndominate latency and mask cache-dependent variations. At high the original prompt word-by-word or retrieve missing/private data\nRPS, hits and misses become nearly indistinguishable across all fields (e.g., names, dates, identifiers) without direct access to the\nmodels, effectively eliminating the timing leak. cache [31, 58]. In conclusion, while prior work acknowledges APC-induced\ntiming side channels [16, 31, 51, 75], our motivational analysis 2.4 Defenses Against APC leaks\nshows that the strength and exploitability of the channel depend Recent works have proposed several defenses against timing leaks\non the following critical parameters: induced by APC. These defenses fall into three main categories,\n• the length of the shared prefix: longer prefixes are more each with distinct trade-offs between security and performance, as\nvulnerable. summarized in Table 1.\n• the size of the LLM model: larger models are more vulnerable.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 11, + "total_chunks": 59, + "char_count": 1556, + "word_count": 232, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "701ac02b-c9cb-4d82-868a-13f79b158927", + "text": "Cache Isolation and Partitioning. On the one extreme, several• the system load: lower load (requests received per second) is\nworks propose mitigating APC leaks by eliminating cache sharmore vulnerable.\ning entirely across different users and security domains. Input-• the underlying hardware platform: since it directly influSnatch [75] and Auditing Prompt Caching [31] advocate per-user\nences timing measurements.\ncache isolation or disabling prefix reuse across requests. Cache\nThese parameters jointly determine whether timing side chanPartitioning [51] similarly enforces user-boundaries within the KVnels are distinguishable and, should be taken into consideration\ncache to prevent cross-user hits. These approaches ensure strong\ninto building robust and practical mitigation mechanisms.\nsecurity guarantees and effectively block timing leaks due to shared\nKey insight: While APC introduces timing side channels, their prefixes by addressing the leak at its source. However, they also\nexploitability depends on various parameters: the prefix length, sacrifice the performance benefits of prefix caching, which can reLLM size, system load, and hardware platform. Prior work sult in significant overhead given that \"real-world LLM workloads\ntreated APC leaks as uniformly exploitable, whereas our analy- frequently exhibit significant cross-query reuse\" [16].\nsis reveals that these parameters jointly determine side-channel\nShortcoming 1: User-based cache isolation prevents timing strength and must guide efficient and practical mitigation.\nleaks but introduces overhead due to reduced prefix sharing. 2.3 Prompt Stealing Attacks via APC leaks Timing Obfuscation. On the other extreme, Carlini and Nasr [14]\nIn multi-tenant systems the prefix cache is shared between requests propose to secure systems by making the execution time of inferfrom different users and security domains.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 12, + "total_chunks": 59, + "char_count": 1886, + "word_count": 260, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ba1aa0f9-2c3f-413e-b1ce-c9dc7c06a280", + "text": "Hence, these visible ence \"constant\" across hits and misses. For this, they propose to\ntiming differences can be exploited by attackers to incrementally inject noise in the system to mask timing signals and, ultimately, Related Work Mitigation Prefix Low analysis, limiting scalability and leaving misclassified requests exReuse TTFT posed. These shortcomings highlight a fundamental gap: we need\nPartitioning [51] Per-user Isolation ✗ ✗ a defense that adapts to the conditions under which timing leaks are\nInputSnatch [75] Per-user Isolation ✗ ✗\nexploitable, secures prefix reuse without blind per-user isolation,\nAudit [31] Per-user Isolation ✗ ✗\nand preserves the efficiency of APC across diverse workloads, with- Remote [14] Timing Obfuscation ✓ ✗\nout introducing unbearable overheads in return for security. This SafeKV [16] Selective Sharing ✓ ✗\nmotivates the design of a new system that achieves robust security\nCacheSolidarity Selective Sharing ✓ ✓\nwith lightweight, practical mechanisms, enabling multi-tenant LLM\nTable 1: Comparison of current mitigation strategies. serving to remain both performant and secure. close the side channel by making hits and misses in the prefix cache 3 SYSTEM DESIGN\nindistinguishable to attackers. That is, these defenses would ensure In this section, we introduce CacheSolidarity, that provides systemthat the distributions of hits and misses in 2 overlap. Although level security against timing leaks induced by APC. We start by\nobfuscation defenses can prevent APC leaks, they have two lim- providing a high-level overview of CacheSolidarity and its objecitations. First, they introduce uniform delays across all requests, tives (3.1).", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 13, + "total_chunks": 59, + "char_count": 1680, + "word_count": 247, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3f7ef728-90ec-4561-adaf-c9cb68a2b447", + "text": "Next, we describe CacheSolidarity's core components\npenalizing benign users and reducing overall performance. Sec- (Sections 3.2 and 3.3).\nond, they do not address the source of the timing leak (i.e., the\nprefix cache being shared between security domains) and histor- 3.1 Overview and Objectives\nically obfuscation defenses for timing leaks are less robust since CacheSolidarity deploys a lightweight defense mechanism that\nattackers can often find ways of amplifying even small differences secures multi-tenant LLM serving systems against timing sidein execution time [29, 38, 63]. channel attacks caused by Automatic Prefix Caching (APC). CacheSolidarity builds on the key insight that preventing timing sideShortcoming 2: Obfuscation-based techniques do not address\nchannel attacks does not require isolating entire users (like it is done\nthe source of leaks and introduce overheads for benign requests.\nin [31, 51, 75]); it only requires isolating prefixes that might lead\nSelective Cache Sharing. To balance security and efficiency, some to attacks. Building on this insight, CacheSolidarity allows cached\ndefenses restrict cache sharing rather than disabling it completely. prefixes to be reused across users but stops reuse beyond prefixes\nSafeKV [16] introduces a multi-tier detection pipeline that seman- that could reveal sensitive information. This selective prefix isolatically classifies requests using rule-based checks and semantic tion works on a simple principle: attacks require different users to\nvalidation via LLMs either as \"sensitive\" (i.e., that may contain share and reuse a prefix. CacheSolidarity enforces this by tracking\nsecret information) or non-sensitive. In SafeKV, only requests clas- how users interact with shared cache entries without requiring\nsified as sensitive are isolated at cache-level, thereby preventing any knowledge of prompt semantics [16].", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 14, + "total_chunks": 59, + "char_count": 1890, + "word_count": 269, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c83b5c32-98f4-490d-bcec-e4b4b97ffa13", + "text": "This approach embodies\nother users (attackers included) from hitting on them. Although this the principle of cooperative efficiency: benign users benefit from\napproach prevents leakage without fully disabling APC, its reliance shared caching through solidarity, where common prefixes remain\non heavyweight semantic LLM-based analysis limits scalability for accessible to all. When suspicious behavior threatens this solidarity,\nhigh-performance deployments. Moreover, some requests might be CacheSolidarity intervenes by isolating the corresponding prefixes\nmisclassified and, therefore, left unprotected. rather than penalizing all users. This design ensures that cache\nreuse within and across users is maximized to the greatest extent\nShortcoming 3: Current selective sharing defenses may fail in possible without compromising security.\npreventing leaks for \"misclassified\" requests and still introduce\nObjectives. The design of CacheSolidarity is guided by the fol- unnecessary overhead.\nlowing three objectives:\n2.5 Summary [O1] Secure Prefix Reuse: Prevent timing attacks introduced\nby Automatic Prefix Caching (APC) by isolating suspicious\nOur motivational analysis reveals that Automatic Prefix Caching prefixes, avoiding heavy-handed approaches such as per-\n(APC), while critical for high-throughput LLM inference, introduces user isolation or disabling cache sharing.\ntiming side channels whose exploitability depends on concrete pa- [O2] Performance Preservation: Maximize cache reuse and\nrameters such as the prefix length, LLM size, system load and, accelerate inference speed, across diverse workloads and\neffectively, the underlying hardware platform. Longer prefixes that LLM models, thereby ensuring that security does not come\nare reused across users and larger models amplify latency gaps be- at the cost of performance.\ntween cache hits and misses, enabling adversaries to infer sensitive [O3] Lightweight and Practical Design: Provide real-time proprompts under low-load conditions. Existing defenses fail to recontection with minimal overhead, without requiring semantic\ncile this tension between security and efficiency: isolation-based analysis or prior knowledge of sensitive content, and inteapproaches eliminate cache sharing entirely, sacrificing the perforgrate seamlessly with state-of-the-art LLM serving systems.\nmance benefits that make APC essential; obfuscation techniques\ninject noise without addressing the root cause, imposing uniform System overview. To achieve these objectives, CacheSolidarity\ndelays and remaining vulnerable to amplification attacks; and selec- introduces a lightweight pipeline that operates on the requesttive sharing mechanisms rely on heavyweight LLM-based semantic serving path, as illustrated in Figure 4. CacheSolidarity extends the KV cache with minimal metadata per entry to enable fine-grained CacheSolidarity\ntracking of user-prompt interactions and flag suspicious prefixes Request ON\nthat are shared and reused across users. When a request arrives, the Activator Detector\nActivator first decides whether selective isolation should be enabled\nbased on the distinguishability of the timing side channel, under Request Hit Checksthe current LLM model, prefix length, system load and underlying KV Cache Extension\nhardware platform. At the same time, the LLM serving system DATA ATTACK FLAG OWNER ID\nchecks whether the prefixes that are part of the request hit or miss Users attack\nthe cache. On a cache miss, a new cache entry is created and tagged flag\nwith the user's ID; no further action is needed because misses do Miss\nandnot leak timing information. Response LLM Isolated prefix\nOn a cache hit, the Detector determines whether the prefix should owner Response Shared prefixbe isolated to prevent attacks or shared as usual, following the pro- ID\ncedure described in Section 3.2.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 15, + "total_chunks": 59, + "char_count": 3853, + "word_count": 534, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "4cd2c6e8-9fc0-4d8c-b74f-b02946078e00", + "text": "If the prefix is shared (marked as 3\nin Figure 4), the request proceeds with full reuse and the response is\nFigure 4: System Design of CacheSolidarity.\nreturned directly to the user. If the prefix is isolated, CacheSolidarity stops reuse at that point and recomputes the remaining tokens exploitable. When a request reuses cache entries created by another\nthrough the LLM, creating a new cache entry for the isolated pre- user, the risk of prompt stealing begins. By flagging a suspicious\nfix.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 16, + "total_chunks": 59, + "char_count": 493, + "word_count": 82, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "74d386ae-0606-49c4-9f89-cdc0c88aeee2", + "text": "This results in a partial hit: the request benefits from reuse up cache entry, CacheSolidarity sets a boundary: reuse up to this point\nto the shared prefix, while the remaining portion is isolated and is allowed, but going further is restricted for non-owners (potential\nrecomputed to prevent leakage. attackers). CacheSolidarity consists of the following system components: 3.2.2 Detection Pipeline. When a cache entry is created on a cache\n• KV Cache Extension: Augments each cache entry with meta- miss, its metadata is initialized: OwnerID is set to the current user ID\ndata to track user-prompt interactions and flag suspicious pre- and AttackFlag is cleared. When there is a cache hit, the request\nfixes (Section 3.2). is sent to the Detector for validation. At this stage, the Detector\n• Detector: Examines prefix reuse to identify cross-user access examines whether the prefix should remain shared or be isolated\nand enforce selective isolation when suspicious reuse occurs to prevent leakage. Two scenarios can occur during validation:\n(Section 3.2). • Hit on an unflagged prefix: If the prefix is unflagged and\n• Activator: Dynamically activates selective isolation only under owned by the requesting user, reuse proceeds without changes,\nconditions that create exploitable timing side-channels, balanc- the cached entry is returned back to the user. If the prefix is\ning security with performance (Section 3.3). owned by a different user, the Detector flags this prefix as isolated\nTogether, these components enable CacheSolidarity to maximize se- for future requests, ensuring that reuse beyond this point is\ncure cache reuse while mitigating timing attacks without sacrificing blocked for non-owners.\nefficiency. • Hit on a flagged prefix: If the prefix is previously flagged, the\nDetector checks whether the next prefix in the prompt belongs\n3.2 KV Cache Extension and Detector to the requesting user. If yes, reuse continues normally. If not,\n3.2.1 KV Cache Extension. CacheSolidarity extends each cache reuse stops at the flagged prefix, and the remaining ones are\nentry with the following metadata fields: recomputed through the LLM.\n• OwnerID: the identifier of the user who first created and pop- 3.2.3 Example Workflow. Figure 5 illustrates how CacheSolidarity\nulated this cache entry.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 17, + "total_chunks": 59, + "char_count": 2305, + "word_count": 356, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cde1e0b8-46f2-4c18-8268-7824d00e3af9", + "text": "It is set exactly once at allocation time handles reuse across three users that issue requests at consecutive\nand remains immutable. timestamps 𝑡1–𝑡4. Each request contains similar prompts that differ\n• AttackFlag: a marker indicating that this prefix has been flagged only on private data, such as names. In the figure, circles represents\nfor isolation due to being reused by multiple users. When set, cache entries, and arrows show the sequence of prefixes in a prompt.\nprefix reuse beyond this point must be disabled for non-owners. Red circles indicate cache misses, green circles indicate cache hits,\nThese fields add only a few bytes per entry and do not affect tensor and orange circles mark flagged entries where selective isolation\nlayout or memory mapping.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 18, + "total_chunks": 59, + "char_count": 766, + "word_count": 123, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6fbb0ba2-7e09-42db-b8f3-ff2c62b8a881", + "text": "Next, we explain why we track these applies. Next, we describe what happens at each timestamp:\nmetadata.\n𝑡1: User 1 submits a prompt and populates the cache with a new\nWhy Track the Owner? Tracking OwnerID is essential because cache entries tagged with OwnerID = 1. All entries are unflagged.\nany user other than the original creator is a potential attacker 𝑡2: User 2 issues a prompt that reuses the first two entries from\nand may attempt to infer information about private fields of a User 1. During detection, CacheSolidarity checks the OwnerID of\nprefix by exploiting timing differences. By recording ownership, every entry and identifies that the second entry belongs to a difCacheSolidarity can distinguish benign reuse from suspicious one ferent user. This entry is then flagged by setting its AttackFlag\nand enforce selective isolation only when necessary. (orange in the figure). New and separate cache entries are created\nbeyond this flagged entry for User 2. Why Flag a Cache Entry? Flagging an entry is essential because 𝑡3: User 1 sends a follow-up prompt. Although the flagged entry is\ncross-user reuse is the point where timing side channels become reused, this is acceptable because User 1 is the original owner. Write a Write a Write a\nconversational conversational 1 conversational 1 1 Writeconversationala 1\nChristina for a job Christina for a job George for a job application application cover letter for cover letter for cover letter for application cover letter for\nChristina for a job 9 6 2 6 2 6 2 2 application\nGeorge for a job as secretary George for a job George for a job as secretary as programmer George for a job application application application\napplication as secretary 10 3 7 3 7 3 7 3\nas programmer as programmer at NullPointer Labs as programmer at NullPointer Labs as programmer at NullPointer Labs at GitPushForce Ltd.\n4 4 8 4 8 4 8 11", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 19, + "total_chunks": 59, + "char_count": 1875, + "word_count": 319, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "87098ebd-87b6-4b01-8a14-43ce9bcff268", + "text": "at GitPushForce Ltd. at GitPushForce Ltd. at GitPushForce Ltd. at GitPushForce Ltd.\n5 5 5 5 Figure 5: Example workflow of CacheSolidarity. Thus, CacheSolidarity allows User 1 to reuse all prefixes beyond (i.e., hits and misses are statistically indistinguishable), while values\nthe flagged entry without restriction (owner-aware continua- near 0% mean they are well separated (i.e., an attacker can easily\ntion). tell them apart).\n𝑡4: User 3 attempts to reuse User 1's prompt but with a different When the overlap exceeds a threshold 𝜃, CacheSolidarity deactiname in the private information. CacheSolidarity detects that the vates selective isolation and allows full prefix reuse. However, the\nAttackFlag is set and that the OwnerID differs, so continuation cache is still updated with the extra metadata. Conversely, when\nis not allowed.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 20, + "total_chunks": 59, + "char_count": 838, + "word_count": 127, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "43816802-256f-470f-908f-1ea21cb63746", + "text": "The common prefix is isolated at the flagged the overlap falls below 𝜃, timing differences are exploitable, and\nentry to prevent leaks, and new cache entries are created for CacheSolidarity activates its detection and prefix isolation mechUser 3. nism. This decision is evaluated at every request using the most\nrecent sliding window of TTFT samples, ensuring that activation\nIn summary, this example illustrates CacheSolidarity's selective adapts dynamically to workload conditions. The selection of the\nisolation: flagged cache entries restrict cross-user continuation, threshold 𝜃is left to the system administrator and it should reflect\ndiverging into new secure paths for non-owners, while allowing (1) the point where hits and misses are statistically indistinguishfull reuse along the original path for the owner. We stress that able, which would make timing attacks impractical, and (2) the\nCacheSolidarity always allows owners to extend their prefixes even\ntrade-offbetween stronger security and performance overhead.\nthough some nodes might be flagged to ensure that benign requests\nare not delayed. For instance, User 2 would be allowed to further\n4 SECURITY ANALYSIS extend their own path 1-2-6-7-8 although node 2 is flagged. In this section, we analyze the security guarantees provided by\nCacheSolidarity. We start our analysis by focusing on the case\n3.3 Activator Optimization\nwhere CacheSolidarity is activated (3.2), and later on we analyze\nAs shown in Section 2.2, timing side channels due to APC are con- the impact of the activator (3.3) on the security of the system.\nditionally distinguishable and exploitable: latency gaps between\ncache hits and misses grow with (i) longer shared prefixes and (ii) Security guarantees when CacheSolidarity is activated. CacheSolilarger model sizes, but are (iii) progressively masked as request- darity's selective isolation scheme has been designed to provide\nper-second (RPS) increases due to batching, queuing, and GPU sat- security against prompt-stealing attacks performed through APC\nuration effects, (iv) while the hardware platform directly influences timing side channels. In this setting, an attacker attempts to recontiming differences. Therefore, a defense that is always-on would struct a prompt that contains some secrets information. The prompt\nover-protect under high load and impose unnecessary overhead reconstruction works by discovering one entry at-a-time and can be\non benign users. CacheSolidarity addresses this by allowing sys- split into multiple entry-reconstruction problems, each one focusing\ntem administrator to conditionally activate selective isolation only on discovering the next secret entry secret given a known prefix\nwhen timing signals are distinguishable for the current operating pre. We analyze CacheSolidarity's security guarantees in terms of\npoint (model, hardware, workload). one of these sub-problems. To achieve this, CacheSolidarity continuously monitors the la- Let pre·secret be a prefix where pre is a known part extended with\ntency gap between hits and misses over a sliding time window a secret entry secret. To reconstruct the secret entry, an attacker\nand computes the Kernel Density Estimation (KDE) overlap between (1) issues a probing sequence pre · 𝑣1, . . . , pre · 𝑣𝑛consisting of many\ntheir distributions.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 21, + "total_chunks": 59, + "char_count": 3327, + "word_count": 491, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b4864633-d332-446c-acc1-c39821b78ab3", + "text": "The KDE overlap [34] is a statistical measure of different requests pre ·𝑣𝑖sharing a common prefix pre and differing\nsimilarity that indicates how distinguishable two distributions are. in the last entry, while (2) measuring the TTFT of the probes to\nValues close to 100% mean the two distributions are nearly identical determine whether they results in hits or misses in the prefix cache. To prevent these timing attacks, CacheSolidarity identifies pre- Security impact of threshold-based activation. The activator (3.3)\nfixes that are shared between different users and selectively isolates turns offCacheSolidarity's selective-isolation scheme whenever\nthem, which stops an attacker's probing sequences from leaking in- exploiting the timing side-channel is impractical (captured by the\nformation about a victim's prefix.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 22, + "total_chunks": 59, + "char_count": 824, + "word_count": 118, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8b3feaa7-051d-4a78-9a70-ae1a2d05c890", + "text": "Below, we precisely characterize KDE threshold specified by the administrator). This can impact the\nthe guarantees provided by CacheSolidarity's isolation scheme. security of the system in two ways. If the system administrator\nchooses a too-low overlap threshold, this may results in the system\nbeing unprotected even when the attacker can effectively exploit\nSecurity guarantees: Consider an arbitrary initial state of the\nthe timing side-channel induced by APC. Even if the overlap thresh- KV-cache 𝑐containing a secret prefix pre · secret owned by user\nold is properly chosen (i.e., the system is disabled only when the\n𝑢and a probing sequence 𝑠:= pre · 𝑣1, . . . , pre · 𝑣𝑛(for different\nsignal-to-noise ratio in the side-channel is very low), an attacker values of 𝑣𝑖). If (1) |pre| ≥1, and (2) either pre's final entry in 𝑐\nmight still attempt to bias the activation logic by shaping traffic\nis already flagged or 𝑣1 ≠secret, then CacheSolidarity ensures\npatterns. For example, issuing bursts of requests to inflate the KDE that executing all probes in the sequence 𝑠from user identioverlap and force deactivation, or throttling requests to reduce fiers different from 𝑢(even interleaved with additional disjoint\noverlap and trigger isolation for probing. CacheSolidarity mitigates\nrequests) will always result in misses for entries 𝑣1, . . . , 𝑣𝑛.\nthis risk through two design choices. First, the Activator computes\nthe KDE overlap over a sliding window of recent requests, requiring\nWe note that CacheSolidarity's security guarantees are not un- sustained conditions rather than reacting to instantaneous fluctuaconditional, that is, CacheSolidarity protects against probing se- tions. Second, activation decisions rely on aggregate system-level\nquences only when conditions (1) and (2) above are met. We remark measurements rather than per-user metrics, making it difficult for\nthat this is a conscious choice made as part of CacheSolidarity's a single attacker to dominate the observed distributions. Together,\ntrade-offbetween security and performance. Next, we describe the these mechanisms ensure that activation cannot be easily maniputwo cases not covered by CacheSolidarity: lated by short-term traffic manipulation. Furthermore, we remark\n• Attacks targeting the first entry: CacheSolidarity does not that even when selective isolation is disabled, CacheSolidarity still\nprotect against prompt-stealing attacks targeting the first entry tracks and updates the metadata in the cache. That is, prefixes that\nof a prompt, i.e., against cases where there is no (non-empty) are shared between different users are still flagged in the prefix\nprefix (that is, condition (1) above). This follows from the fact cache even when the isolation is turned off, that is, an attacker\nthat in this case there is no parent node that CacheSolidarity can cannot exploit the threshold to manipulate the cache metadata.\nflag as potentially involved in an attack.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 23, + "total_chunks": 59, + "char_count": 2957, + "word_count": 450, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ffb7cedc-2860-498c-a412-a7624af0bdb6", + "text": "However, as we show in\n2.2, the security risks associated with attacks targeting prefixes 5 EVALUATION\nconsisting of a single entry is limited, since even for the largest We evaluate CacheSolidarity to understand its impact on performodel we consider the distributions of hits and misses overlap mance and scalability in multi-tenant LLM serving environments.\n(2d), which makes the timing leak more difficult to exploit. Fur- Our goal is to quantify the performance benefits under realistic\nthermore, in real world deployments prefix caching is often only\nworkloads, across LLM families and sizes, while examining the\nenabled for longer prefixes, e.g., OpenAI enables prefix reuse overhead introduced by CacheSolidarity. Specifically, we address\nonly for prompts longer than 1024 tokens [49].\nthe following research questions:\n• Attacks correct at the first attempt: CacheSolidarity does\nnot protect an unflagged prefix pre · secret against an attacker RQ1: Performance Gains: How much does CacheSolidarity acthat guesses correctly the secret value on the first attempt (i.e., celerate inference compared to existing defenses across difcondition (2) above). This is due, again, to the lack of a flagged ferent workloads and models (Sections 5.2.1 and 5.2.2)?\nprefix pre that is needed for CacheSolidarity to start isolating a RQ2: Overheads: What are the latency and resource overheads inrequest. From a security perspective, this can be problematic if troduced by CacheSolidarity (Sections 5.2.1, 5.2.2, and 5.2.4)?\nthe space of secrets is very small since an attacker might easily RQ3: Security: Can CacheSolidarity prevent timing leaks and\nguess correctly on the first attempt, and the risk decreases for prompt-stealing attacks? (5.2.3)\nlarger secret spaces. Note that preventing these attacks by always RQ4: Sensitivity: What is the impact of the activation threshold\ntreating a prefix as flagged would make CacheSolidarity fall back on CacheSolidarity's performance? (5.2.4)\nto naive user-level isolation, which introduces too high overhead. Beyond the two cases mentioned above, CacheSolidarity success- 5.1 Experimental Setup\nfully secures the system against all remaining probing sequences. We evaluate the performance of CacheSolidarity on a native hardWe remark that CacheSolidarity's guarantees do not depend on ware server equipped with an NVIDIA A100 GPU (40 GB memory).\nidentifying secret information at a semantic level (like in [16]), but The serving system is vLLM [64] (version 0.8.5) with a block size\nrather on the overlap between prefixes from different users and of 16. Table 2 summarizes the LLM models used for evaluation,\nsecurity domains. Furthermore, CacheSolidarity's guarantees hold detailing their full name, their parameter count, memory footprint,\nregardless of (1) whether a probing sequence is executed by a single and the remaining GPU memory available for hosting the KV cache.\nattacker or multiple colluding ones, (2) whether benign requests\nfrom other users are interleaved in the probing sequence (since Evaluation Baselines. We compare CacheSolidarity against the\nthese requests at most would result in evicting the victim prefix following baselines, based on the categorization of defenses in\nfrom the prefix cache), and (3) the initial state of the prefix cache.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 24, + "total_chunks": 59, + "char_count": 3305, + "word_count": 493, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ae1b7647-60b0-451d-9b23-0a3bb9224519", + "text": "Abbreviation LLM Model Model Cache Intra-User Reuse Inter-User Reuse\nMemory Memory 100 Gemma3-4B gemma-3-4b-it [27] 8.6 GB 31.4 GB\nLlava-0.5B 1.68 GB 38.32 GB (%) 80 llava-onevision-qwen2- 60\n0.5b-ov-hf [44]\nLlava-7B llava-onevision-qwen2-7b- 15.03 GB 24.97 GB 40 ov-chat-hf [45] Overlap 20\nLlama2-7B Llama-2-7b-chat-hf [48] 12.6 GB 27.4 GB\nLlama2-13B Llama-2-13b-chat-hf [47] 24.3 GB 15.7 GB 0 1 2 3 4 5 Avg\nQwen2-2B Qwen2-VL-2B-Instruct [5] 4.15 GB 35.85 GB Workload Label\nQwen2.5-3B Qwen2.5-VL-3B-Instruct [7] 7.16 GB 32.84 GB\nQwen2-7B Qwen2-VL-7B-Instruct [6] 15.53 GB 24.47 GB Figure 6: Workload Construction. Qwen2.5-7B Qwen2.5-VL-7B-Instruct [8] 15.63 GB 24.37 GB\nTable 2: LLM models used for evaluation, sorted by family\nand model size. prefix overlap to capture a wide range of behaviors. Higher overlap\nmeans a longer shared prefix, which leads to more cache reuse and\nbetter performance. For example, private data at the end of the\n• Prefix Caching: Enables reuse of common prefixes cached prompt (\"You are a helpful assistant. I want you to write an email\nacross all users.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 25, + "total_chunks": 59, + "char_count": 1085, + "word_count": 168, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "58a5fb3e-f212-431c-9abc-eaf69ab539d1", + "text": "This approach maximizes reuse and efficiency to reply to [sensitive information]\") creates high overlap, while\nbut offers no isolation or security guarantees. That is, this is at the beginning (\"My name is [sensitive information]\") results in\nvanilla unprotected vLLM with APC active. minimal overlap.\n• User Cache Isolation: Allocates dedicated KV cache regions To capture a multi-user environment, we vary both the intraper user to ensure isolation and security [31, 51, 75]. However, user reuse (how often a user reuses a previously cached prefix)\nthis eliminates the benefits of sharing cached common prefixes, and inter-user reuse (how much different users share common prereducing efficiency. We implemented this by modifying vLLM fixes).", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 26, + "total_chunks": 59, + "char_count": 744, + "word_count": 111, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "99e7d5c7-d908-4a26-ba87-1522632962b2", + "text": "Figure 6 illustrates these dimensions across our constructed\nby preprending a unique user identifier (a single token) to each workloads. To capture a system under attack, one user acts as an aduser request, which effectively results in isolating all requests of versary targeting a popular prompt template. The attacker follows\na user from those of other users. the methodology described in Section 4, sending multiple variations\n• CacheSolidarity: Our proposed mechanism dynamically iso- of the template with different candidate values at the suspected\nlates common prefixes rather than users, achieving a balance position of the private information (secret).\nbetween security and performance. For all workloads and experiments in 5.2, the arrival times of\nThe baselines are evaluated against two key metrics: time-to-first- requests are designed so that the system processes 1 request per\ntoken (TTFT) and cache hit rate. TTFT is a widely used metric second, thereby ensuring that the system operates in a scenario\nfor assessing LLM inference performance [64], as it reflects the where the side-channel is easy to exploit (as shown in 2.2). Note\nresponsiveness of the system in generating the first token.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 27, + "total_chunks": 59, + "char_count": 1207, + "word_count": 186, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3e35c6c1-54cf-4fb7-8203-9949e2be83d3", + "text": "In con- also that, for all experiments in 5.2, the detector component of\ntrast, cache hit rate indicates the degree to which the shared prefix CacheSolidarity is always active.\ncache was utilized, providing insight into the effectiveness of the\nImplementation Details. CacheSolidarity is implemented on topevaluated baselines.\nof vLLM [64] (version 0.8.5) with the V1 engine for multi-tenancy. For our experiments, we construct 5 multi-user work- To extend the KV cache as described in 3.2, we extend the KVloads starting from the ShareGPT [9] dataset, which contains real- CacheBlock class with the two metadata fields. We introduce a new\nworld LLM requests.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 28, + "total_chunks": 59, + "char_count": 659, + "word_count": 102, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "8f1c6bfc-3823-4834-9f6b-cdad842070dd", + "text": "Given that we want to understand the effects Detector component, as part of the Scheduler that detects maliof prefix caching, the workloads we constructed cover varying level cious requests before scheduling and updates the new metadata\nof prefix overlap between requests fields attacker_flag and owner_id. We extend the metrics collecEach workload consists of 10 users, each one issuing 100 re- tion of vLLM to support the Activator component which is implequests, i.e., 1000 requests overall. The incoming times of the re- mented in the asynchronous LLM execution layer. CacheSolidarity\nquests follow a Poisson distribution, which is representative of is extensively documented and will be open-sourced to encourage\nLLM workloads [3, 39, 53, 76]. To create realistic workloads con- community adoption and future extensions.\ntaining private information from different users, we processed the\nShareGPT requests as follows: 5.2 Inference Performance\n• For each request in ShareGPT, we asked the BERT [21] model to\nHere, we analyze how CacheSolidarity behaves in terms of inference\nidentify sensitive words in various position in the request and\nperformance for different kinds of workloads (5.2.1) and models\nto mask them, i.e., to replace them with [Mask].\n(5.2.2). Next, we empirically validate CacheSolidarity's behavior on\n• Next, for each masked request, we asked BERT the masked toattacker's perceived performance on a concrete workload (5.2.3). We\nkens with different candidates, thereby simulating multiple inconclude by analysing how the different components of CacheSolistances of a public request that differ only in user-specific sensidarity contribute to the overall overhead (5.2.4).\ntive information. Then, we constructed the workloads by selecting requests from this 5.2.1 Across Workloads.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 29, + "total_chunks": 59, + "char_count": 1805, + "word_count": 265, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "676855c2-3486-4267-adb1-9b3aa9ba687b", + "text": "Figure 7 reports average time-to-firstaugmented dataset. We mix various prompts with varying levels of token (TTFT) and cache hit rate across the five representative Prefix Caching CacheSolidarity User Cache Isolation Prefix Caching CacheSolidarity User Cache Isolation\n(ms)50 40\n45 (ms) 30 TTFT40\n35 TTFT 30 20\n25 Average20 1 2 3 4 5 Avg 10 Average Workload Label (a) Average TTFT. 0\nPrefix Caching CacheSolidarity User Cache Isolation 100 Gemma3-4BLlava-0.5BLlava-7BLlama2-7BLlama2-13BQwen2-2BQwen2.5-3BQwen2-7BQwen2.5-7B\n(%) 80 (a) Average TTFT.\n60 Prefix Caching CacheSolidarity User Cache Isolation\nRate 40 100\nHit 20\n0 1 2 3 4 5 Avg\nWorkload Label (%) 60\n(b) Cache Hit rate. Rate\nHit\nFigure 7: Comparison of baselines across various workloads. 20 workloads (Figure 6) and their aggregate average. The reportedvalues exclude the attacker.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 30, + "total_chunks": 59, + "char_count": 843, + "word_count": 125, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cd152939-b2c4-4f45-b25a-b3f71740bcd7", + "text": "Gemma3-4BLlava-0.5BLlava-7BLlama2-7BLlama2-13BQwen2-2BQwen2.5-3BQwen2-7BQwen2.5-7B\nWorkloads with high intra-user reuse. Starting with one of the (b) Cache hit rate.\nextreme cases, Workload 1 exhibits high intra-user reuse and zero\ninter-user reuse, meaning that each user repeats similar prompts Figure 8: Comparison of baselines across various LLMs.\nbut does not share common prefixes with other users. As expected,\nfor this workload all baselines behave similarly. Prefix Caching,\nCacheSolidarity, and User Cache Isolation achieve comparable hit\nrates and TTFT because reuse occurs only within user boundaries. staying within roughly 5–10% of Prefix Caching, which is the\nCacheSolidarity introduces negligible overheads in this scenario, best-performing but insecure approach.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 31, + "total_chunks": 59, + "char_count": 779, + "word_count": 101, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "bd8ea832-0351-4f00-8bcb-407002a983e6", + "text": "This demonstrates that\nmatching the efficiency of isolation while maintaining security. CacheSolidarity preserves most of the performance benefits of preSimilarly, Workload 2 maintains high intra-user reuse but introduces fix caching while enforcing strong security guarantees, even under\nmoderate inter-user reuse. As expected, User Cache Isolation suffers diverse workload patterns.\nlower hit rate and higher TTFT because it misses shared prefixes,\nwhile CacheSolidarity stays very close to Prefix Caching, showing 5.2.2 Across LLM Models. Figure 8 shows the average TTFT and\nthat selective isolation preserves most performance benefits. cache hit rate of CacheSolidarity compared to the baselines across\nthe LLM models summarized in Table 2. These models span different\nWorkloads with high inter-user reuse. On the other extreme,\nfamilies and parameter sizes. The model size directly impacts the\nWorkload 5 has high inter-user reuse and zero intra-user reuse,\navailable KV cache space on our 40 GB NVIDIA A100 GPU: larger\nmeaning that users ask similar prompts to each other but not to\nmodels leave less memory for the cache (Table 2). It also affects\nthemselves.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 32, + "total_chunks": 59, + "char_count": 1166, + "word_count": 174, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "cb215a51-2262-4944-8861-c94135f31323", + "text": "This explains why User Cache Isolation has zero cache\ninference speed: the smaller the model, the faster the inference and\nhit rate, leading to cache misses and the highest TTFT among all\nthe lower the TTFT [52].\nbaselines. Most importantly, this workload stresses the critical\nAs discussed in Section 2 and shown in Figure 8a, the perforpath of CacheSolidarity by activating the detector on every request\nmance gap between Prefix Caching and User Cache Isolation widens\ndue to the high inter-user reuse. However, we see that CacheSolias model size increases and also varies by model family. Across\ndarity's lightweight design does not hurt performance: TTFT reall models, CacheSolidarity effectively closes this gap, remaining\nmains very close to Prefix Caching, and hit rate is similarly high,\nwithin a 6% margin of Prefix Caching, while significantly outperdemonstrating that CacheSolidarity can enforce security without\nforming User Cache Isolation. This demonstrates that CacheSolisacrificing efficiency even under worst-case conditions.\ndarity's selective isolation delivers secure and fast inference. Across all workloads (Avg group of bars), CacheSoli- ure 8b, cache hit rates are similar across models except for LLaMAdarity performs between the two baselines: it significantly increases 13B. For this large model, the KV cache is very small (Table 2),\nhit rate and lowers TTFT compared to User Cache Isolation, while almost half the size compared to the rest of the models, causing Prefix Caching Cache Solidarity Component Time Memory\nDetector 0.004 ms per request 32 bytes per cache entry\n100 100\nActivator 0.003 ms 16 bytes * len(window)\n80 80 Table 3: Time and memory overheads introduced by the sys-\n(%) tem components of CacheSolidarity.\n60 60 (ms)\nRate\n40 40 TTFTHit\n20 20 and memory resources. In terms of per-request latency, CacheSolidarity's modifications over vanilla vLLM (i.e., the selective isolation\n0 1 5 10 15 20 0 1 5 10 15 20 scheme and the activator component) introduce an average perRequest Index (Attacker) Request Index (Attacker) request overhead of only 0.007 ms (0.004 ms due to the detector\ncomponent and 0.003 ms due to the activator component). In terms\nFigure 9: Comparison of prefix caching (unprotected system) of memory, the metadata introduced by the detector component\nand CacheSolidarity across requests of a single attacker. amount to 32 bytes per cache entry, since each KVCache block was\nfrequent evictions and more cache misses. This limitation also ex- extended with two metadata fields—owner_id and under_attack—\nplains the larger TTFT differences across baselines, where even of type int.1 In contrast, the activator component stores 1 floating\nbetween CacheSolidarity and Prefix Caching there is a margin of point value per request in the sliding window.2\n2 ms. In conclusion, the performance of CacheSolidarity remains\nclose to Prefix Caching across all LLM models, even under severe 5.3 Sensitivity Analysis\nmemory pressure with large models. This confirms that CacheSoli- The selection of the KDE overlap threshold represents a tradedarity enables secure LLM serving without unnecessarily reduced offbetween security and performance. With very low thresholds,\nperformance or unbearable overheads. CacheSolidarity's selective isolation is active only when the distributions of hits and misses are clearly distinguishable.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 33, + "total_chunks": 59, + "char_count": 3379, + "word_count": 513, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0772916e-61c9-42af-bca4-51f8b8aa78ba", + "text": "Conversely,5.2.3 Attacker-perceived Performance. Here, we show how CacheSolwith very high thresholds, selective isolation is active even whenidarity behaves from an attacker's perspective. For this, we created\nthe two distributions are not distinguishable and the timing sidea dedicated \"attacker workload\" consisting of (1) a benign user issuchannel is very difficult to exploit.ing a request consisting of an otherwise public prompt containing\nTo understand the impact of the threshold on the system's perfor-a single piece of secret information in the middle of the prompt,\nmance, we run CacheSolidarity with varying KDE thresholds (fromand (2) 20 requests made by an attacking user that tries to recover\n0% to 100%) for the LLama2-13B model and Workload 4, whichthe secret, i.e., these 20 requests all follow the public prompt while\nexhibits high reuse, making it suitable for studying the effect of theusing 20 different values for the candidate secret. Note that the 9-th\nthreshold. 10 reports the results of our experiments, together withattacker request is the one guessing the secret correctly.\nthe values associated with the different baselines: Prefix Caching We run this \"attacker workload\" on the unprotected Prefix Caching\nin blue, User Cache Isolation in red, and CacheSolidarity when thebaseline and on CacheSolidarity. Figure 9 reports the results of this\ndetector is always active (i.e., the configuration from 5.2) in green.experiment in terms of hit rate and TTFT for the 20 attacker reThe results align with expectations. For low threshold values (closequests (after executing the victim's request first).\nto 0%), the system performs comparably to Prefix Caching for the For the unprotected Prefix Caching baseline, we observe that\nsame workload since selective-isolation is almost always disabled,both hit rate and TTFT are stable across all requests (reflecting\ni.e., CacheSolidarity behaves as the default insecure system. Forthat all requests share the same prompt template), except for the\nhigh threshold (close to 100%), the system achieves the hit rate and9th request where we see a clear spike in both the hit rate and\nperformance of Selective Cache Isolation.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 34, + "total_chunks": 59, + "char_count": 2189, + "word_count": 331, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "50e9ca2e-5f8e-4db6-994a-d4075cc5c2a6", + "text": "In general, increasing theTTFT. In particular, hit rate reaches 100%, which means that the\nKDE threshold increases the security guarantees—because moreattacker's request exactly matches a prompt previously cached,\nrequests are selectively isolated—and introduces more overhead, sothat correlates with a decrease in TTFT. These spikes in hit rate\nthe activator component provides the system administrator withand TTFT are expected since the 9th request is the one correctly\na way of effectively balancing performance and security. We alsoguessing the secret, a fact that the attacker can clearly observe\nnote that, consistently with results from 5.2, for all KDE values,through the timing side-channel. CacheSolidarity outperforms the User Cache Isolation baseline. In contrast, for the system secured by CacheSolidarity, both hit\nrate and TTFT are uniform across all attacker requests. In particular,\n6 RELATED WORKthere is no spike corresponding to the 9th request, i.e., the one where\nthe secret is guessed correctly.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 35, + "total_chunks": 59, + "char_count": 1019, + "word_count": 148, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a7387f6d-96ed-46a3-b510-d0c4d6fcceba", + "text": "That is, an attacker observing TTFT Optimizations of LLM Serving Systems. Recent system-level\ncannot determine which of the 20 requests is the one matching the optimizations use approximate techniques, such as modifying the\nvictim's prompt,. This empirically confirms that CacheSolidarity attention mechanism [20, 35, 42, 68, 77], reusing [25, 26, 69], or\nsuccessfully closes the timing leak on our \"attacker workload\", compressing the KV cache [23, 40, 43, 56, 60, 61]. However, these\nthereby securing the system.\n1Since vLLM is implemented in Python, adding two int fields (as Python objects) to\n5.2.4 Overheads. 3 summarizes the operational overheads intro- the KVCache class results in an increase of 32 bytes.\nduced by the two components of CacheSolidarity in terms of latency 2Again, a floating point value in Python takes up to 16 bytes. System with Activator Prefix Caching et al. [67] do not rely on timing differences; instead, it reconstructs\nSystem without Activator User Cache Isolation\nprompts token by token by exploiting scheduling policies (e.g.,\n40 52\nLongest Prefix Match) and observing changes in serving order.\n(%) 48 Collectively, these works establish the feasibility and severity of (ms) prompt-stealing attacks via timing side-channels, underscoring theRate 20 TTFT 44 need for robust defenses in LLM serving systems.Hit\nFurther defenses against prompt-stealing attacks We already\n0 10 20 30 40 50 60 70 80 90100 10 20 30 40 50 60 70 80 90100 provided a review of existing defenses against prompt-stealing atKDE Overlapping Threshold (%) KDE Overlapping Threshold (%) tacks that focus on mitigating timing leaks through APC in 2.4. Here,\nwe analyze further defenses proposed in the literature. To mitigate\nFigure 10: Comparison of hit rate and TTFT as a function of\nprompt-stealing attacks through APC in vLLM, Song et al. [58] proKDE overlapping threshold.\nposed to increase the number of tokens per cache entry. Although\nthis does not prevent the leak, it makes more difficult for attackerstechniques sacrifice accuracy for faster inference, which may not\nto guess correctly the secret, since now they have to guess all tokesbe acceptable for all use cases. On the other hand, exact techniques\nin an entry all at once.aim to reduce the latency by efficiently managing the KV cache [39,\n55, 72], sharing prefixes [36, 37, 41, 71], or optimizing scheduling\ndecisions [1–3, 15, 33, 73]. Other approaches attempt to mitigate 7 SUMMARY\nthe impact of long-context processing by disaggregating the prefill This paper builds CacheSolidarity, a system that secures LLM servand decode phases of inference [53, 76] or rescheduling requests to ing against timing side-channel attacks introduced by Automatic\nreduce memory fragmentation [59]. Prefix Caching (APC) without sacrificing performance. CacheSolidarity monitors cache access patterns across users, flags suspicious\nSide-channels and performance optimizations. Performance reuse of cached prefixes, and isolates later paths of flagged preoptimizations work by improving resource consumption (e.g., tim- fixes using minimal extensions to KV cache entries. Evaluation\ning, memory, energy) on the average case. Thus, they introduce shows that CacheSentinel achieves real-time protection against\nvariations in the resource usage profile that have been shown, time prompt-stealing attacks and achieves high performance and scalaand again, to lead to side-channels that malicious attackers can bility, effectively closing the gap between unsecure APC and current\nexploit to leak sensitive information.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 36, + "total_chunks": 59, + "char_count": 3568, + "word_count": 537, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7fe53338-3dee-48fe-8f38-df1df41b56e8", + "text": "At the hardware level, histori- heavy-handed mitigations.\ncally the first side-channel attacks exploited leaks that resulted from\nphysical side-effects, such as power consumption or electromag- REFERENCES\nnetic emanations, which require physical access to the device under\n[1] Amey Agrawal, Junda Chen, Íñigo Goiri, Ramachandran Ramjee, Chaojie Zhang,\nattack. More recently, however, side channels have been shown to Alexey Tumanov, and Esha Choukse. 2024. Mnemosyne: Parallelization Strategies\nexist due to microarchitectural components that are shared between for Efficiently Serving Multi-Million Context Length LLM Inference Requests\nWithout Approximations. arXiv:2409.17264 [cs.LG] https://arxiv.org/abs/2409.\nprocesses, such as the CPU cache [70], the DRAM [54], branch pre- 17264\ndiction units [22], the TLB [28], execution ports [4], and the ring [2] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra,\ninterconnect [50]. Side channels are not limited to hardware, though. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2025.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 37, + "total_chunks": 59, + "char_count": 1058, + "word_count": 142, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c92ca78a-f3d1-4aa3-9c33-43f98665b81a", + "text": "Taming\nthroughput-latency tradeoffin LLM inference with sarathi-serve. In Proceedings\nThere have been many side-channels arising from optimizations of the 18th USENIX Conference on Operating Systems Design and Implementation\nin software like browsers [57, 63], operating systems [32], or even (Santa Clara, CA, USA) (OSDI'24). USENIX Association, USA, Article 7, 18 pages.\ncommunication protocols [46, 66]. [3] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 38, + "total_chunks": 59, + "char_count": 522, + "word_count": 71, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "a845ebb2-0dda-465b-bd17-419920afbb2b", + "text": "SARATHI: Efficient LLM Inference\nby Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG]\nTiming Side-Channel Attacks in LLM Serving Systems. Recent https://arxiv.org/abs/2308.16369\nresearch has shown that performance optimizations in LLM serving [4] Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida\nsystems, such as KV caching and prefix reuse, introduce exploitable García, and Nicola Tuveri. 2019. Port Contention for Fun and Profit. In Proceedings\nof the 40th IEEE Symposium on Security and Privacy (S&P '19). IEEE.\ntiming side channels.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 39, + "total_chunks": 59, + "char_count": 584, + "word_count": 80, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d037e34a-47df-4ae7-a842-9f97c926793d", + "text": "Song et al. [58] were among the first to identify [5] Alibaba Cloud. 2024. Qwen2-VL 2B Instruct Model. https://huggingface.co/Qwen/\nthese issues, demonstrating that cache hits and misses produce Qwen2-VL-2B-Instruct Available on Hugging Face Hub.\nmeasurable latency differences that can be leveraged to reconstruct [6] Alibaba Cloud. 2024. Qwen2-VL 7B Instruct Model. https://huggingface.co/Qwen/\nQwen2-VL-7B-Instruct Available on Hugging Face Hub.\nuser prompts.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 40, + "total_chunks": 59, + "char_count": 462, + "word_count": 60, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "06921a34-5f3e-41b2-bbf3-c0812068702b", + "text": "Building on this insight, Zheng et al. [75] proposed [7] Alibaba Cloud. 2024. Qwen2.5-VL 3B Instruct Model. https://huggingface.co/\npractical probing strategies that exploit these timing variations Qwen/Qwen2.5-VL-3B-Instruct Available on Hugging Face Hub.\n[8] Alibaba Cloud. 2024. Qwen2.5-VL 7B Instruct Model. https://huggingface.co/\nto steal sensitive input data in multi-tenant environments. More Qwen/Qwen2.5-VL-7B-Instruct Available on Hugging Face Hub.\nrecently, Gu et al. [31] highlighted privacy risks in commercial APIs, [9] anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/\nrevealing that shared caches can leak user prompts even without datasets/anon8231489123/ShareGPT%20Vicuna%20unfiltered. Dataset on Hugging Face.\ndirect memory access. [10] F. Gptcache: An open-source semantic cache for LLM applications\nBeyond cache-based optimizations, other works have explored enabling faster answers and cost savings. In Proceedings of the 3rd Workshop for\ntiming side-channels introduced by inference optimizations. For Natural Language Processing Open Source Software (NLP-OSS 2023). 212–218. [11] BelleGroup. 2023. Multiturn Chat 0.8M. https://huggingface.co/datasets/\nexample, Carlini and Nasr [14] show that techniques such as spec- BelleGroup/multiturn%20chat%200.8M. Dataset on Hugging Face.\nulative decoding create data-dependent timing variations that can [12] Andrew Bortz and Dan Boneh. 2007. Exposing private information by timing web\napplications. In Proceedings of the 16th International Conference on World Wide\nbe exploited by a network adversary to infer conversation topics Web (Banff, Alberta, Canada) (WWW '07). Association for Computing Machinery,\nor sensitive information over encrypted channels. In contrast, Wu New York, NY, USA, 621–628. https://doi.org/10.1145/1242572.1242656", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 41, + "total_chunks": 59, + "char_count": 1831, + "word_count": 228, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "564608aa-f0d6-47dd-9c9d-f8691f7f501f", + "text": "Language Models are Statistics-Theory and Methods 18, 10 (1989), 3851–3874. https://doi.org/10.1080/\nFew-Shot Learners. Advances in Neural Information Processing Systems 33 (2020), 03610928908830127\n1877–1901. [35] Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo,\n[14] Nicholas Carlini and Milad Nasr. 2024. Remote Timing Attacks on Efficient Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing\nLanguage Model Inference. arXiv:2410.17175 [cs.CR] https://arxiv.org/abs/2410. Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre-filling for Long-\n17175 Context LLMs via Dynamic Sparse Attention. arXiv:2407.02490 [cs.CL] https:\n[15] Shaoyuan Chen, Yutong Lin, Mingxing Zhang, and Yongwei Wu. 2024. Effi- //arxiv.org/abs/2407.02490\ncient and Economic Large Language Model Inference with Attention Offloading. [36] Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and\narXiv:2405.01814 [cs.LG] https://arxiv.org/abs/2405.01814 Xin Jin. 2025. RAGCache: Efficient Knowledge Caching for Retrieval-Augmented\n[16] Kexin Chu, Zecheng Lin, Dawei Xiang, Zixu Shen, Jianchang Su, Cheng Generation. Syst. 44, 1, Article 2 (Nov. 2025), 27 pages. Chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang. 2025. Selec- https://doi.org/10.1145/3768628\ntive KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference. [37] Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré,\narXiv:2508.08438 [cs.CR] https://arxiv.org/abs/2508.08438 and Azalia Mirhoseini. 2024. Hydragen: High-Throughput LLM Inference with\n[17] Rasmus Dahlberg and Tobias Pulls. 2023. Timeless Timing Attacks and Preload Shared Prefixes. arXiv:2402.05099 [cs.LG] https://arxiv.org/abs/2402.05099\nDefenses in Tor's DNS Cache. In 32nd USENIX Security Symposium (USENIX [38] David Kohlbrenner and Hovav Shacham. 2016.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 43, + "total_chunks": 59, + "char_count": 1868, + "word_count": 238, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "57ab1a80-5743-4726-9955-1dfb14b44c6c", + "text": "USENIX Association, Anaheim, CA, 2635–2652. https://www.usenix. times. In Proceedings of the 25th USENIX Conference on Security Symposium\norg/conference/usenixsecurity23/presentation/dahlberg (Austin, TX, USA) (SEC'16). USENIX Association, USA, 463–480.\n[18] DeepMind. 2025. Gomini. https://deepmind.google/technologies/gemini/ [39] W. Zhang,\n[19] DeepSeek. 2024. DeepSeek API Docs: DeepSeek API Introduces Context Caching on and I. Efficient Memory Management for Large Language Model\nDisk, Cutting Prices by an Order of Magnitude. https://api-docs.deepseek.com/ Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating\nnews/news0802/ Accessed: 2025-07-17. Systems Principles. 611–626.\n[20] Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, [40] Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. Gonzalez, and Ion Stoica. 2024.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 45, + "total_chunks": 59, + "char_count": 885, + "word_count": 108, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9c8ba177-ae50-4def-9bb8-58abcb740918", + "text": "HashAttention: Semantic Sparsity for Efficient Generative Inference of Large Language Models with Dynamic KV\nFaster Inference. arXiv:2412.14468 [cs.LG] https://arxiv.org/abs/2412.14468 Cache Management. In 18th USENIX Symposium on Operating Systems Design\n[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 155–172.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 46, + "total_chunks": 59, + "char_count": 414, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2573d88b-a989-4966-bc04-b86092f5e1d3", + "text": "Pre-training of Deep Bidirectional Transformers for Language Understanding. In https://www.usenix.org/conference/osdi24/presentation/lee\nProceedings of the 2019 Conference of the North American Chapter of the Association [41] Jieyu Lin, Sai Qian Zhang, and Alberto Leon-Garcia. 2024. sLLM: Accelerating\nfor Computational Linguistics: Human Language Technologies, Volume 1 (Long and LLM Inference using Semantic Load Balancing with Shared Memory Data StrucShort Papers). 4171–4186. https://doi.org/10.48550/arXiv.1810.04805 tures. In 2024 25th International Symposium on Quality Electronic Design (ISQED).\n[22] Dmitry Evtyushkin, Ryan Riley, Nael B. Abu-Ghazaleh, and Dmitry Ponomarev. 1–6. https://doi.org/10.1109/ISQED60706.2024.10528703\n2018. BranchScope: A New Side-Channel Attack on Directional Branch Predictor. [42] Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring Attention with Blockwise\nIn Proceedings of the 23rd International Conference on Architectural Support for Transformers for Near-Infinite Context. arXiv:2310.01889 [cs.CL] https://arxiv. Programming Languages and Operating Systems (ASPLOS '18). ACM. org/abs/2310.01889\n[23] Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, [43] Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng\nand Mahyar Najibi. 2024. LazyLLM: Dynamic Token Pruning for Efficient Long Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael\nContext LLM Inference. arXiv:2407.14057 [cs.CL] https://arxiv.org/abs/2407. Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 47, + "total_chunks": 59, + "char_count": 1580, + "word_count": 192, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7f91cab7-1972-4a0b-b933-3d3485dcfe02", + "text": "CacheGen:\n14057 KV Cache Compression and Streaming for Fast Large Language Model Serving.\n[24] V. Configurable Safety Tuning of Language Models with Synthetic In Proceedings of the ACM SIGCOMM 2024 Conference (Sydney, NSW, Australia)\nPreference Data. (2024). Preprint. (ACM SIGCOMM '24). Association for Computing Machinery, New York, NY,\n[25] Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo USA, 38–56. https://doi.org/10.1145/3651890.3672274\nDeng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2025. Cost-efficient large [44] LLaVA Team. 2024. LLaVA-OneVision Qwen2 0.5B (OV-HF). https://huggingface.\nlanguage model serving for multi-turn conversations with CachedAttention. In co/llava-onevision-qwen2-0.5b-ov-hf Available on Hugging Face Hub. Proceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference [45] LLaVA Team. 2024. LLaVA-OneVision Qwen2 7B OV-Chat (HF). https://\n(Santa Clara, CA, USA) (USENIX ATC'24). USENIX Association, USA, Article 7, huggingface.co/llava-onevision-qwen2-7b-ov-chat-hf Available on Hugging\n16 pages. Face Hub.\n[26] In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and [46] Aastha Mehta, Mohamed Alzayat, Roberta De Viti, Björn B. Brandenburg, Peter\nLin Zhong. 2024.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 48, + "total_chunks": 59, + "char_count": 1260, + "word_count": 166, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7c4c15d0-dd59-48c4-ad5d-1d897de0714a", + "text": "Prompt Cache: Modular Attention Reuse for Low-Latency Druschel, and Deepak Garg. 2022. Pacer: Comprehensive Network Side-Channel\nInference. arXiv:2311.04934 [cs.CL] https://arxiv.org/abs/2311.04934 Mitigation in the Cloud. In 31st USENIX Security Symposium (USENIX Security\n[27] Google DeepMind. 2025.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 49, + "total_chunks": 59, + "char_count": 301, + "word_count": 35, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f102ec97-b7a4-432a-96db-1205713d4470", + "text": "Gemma 3 4B Instruct Model. https://huggingface.co/ 22). USENIX Association, Boston, MA, 2819–2838. https://www.usenix.org/\ngoogle/gemma-3-4b-it Available on Hugging Face Hub. conference/usenixsecurity22/presentation/mehta\n[28] Ben Gras, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2018. Translation [47] Meta AI. 2023. Llama 2 13B Chat Model. https://huggingface.co/meta-llama/\nLeak-aside Buffer: Defeating Cache Side-channel Protections with TLB Attacks. Llama-2-13b-chat-hf Available on Hugging Face Hub. In Proceedings of the 27th USENIX Security Symposium (USENIX Security '18). [48] Meta AI. 2023. Llama 2 7B Chat Model. https://huggingface.co/meta-llama/LlamaUSENIX Association. 2-7b-chat-hf Available on Hugging Face Hub.\n[29] Ben Gras, Kaveh Razavi, Erik Bosman, Herbert Bos, and Cristiano Giuffrida. 2017. [49] OpenAI. 2024.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 50, + "total_chunks": 59, + "char_count": 841, + "word_count": 101, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "296e92fb-ad70-4f40-823d-6966e2653f0b", + "text": "Prompt Caching: Reduce Latency and Cost with Prompt Caching. ASLR on the Line: Practical Cache Attacks on the MMU. Paper=https: https://platform.openai.com/docs/guides/prompt-caching Accessed: 2025-07-17.\n//download.vusec.net/papers/anc_ndss17.pdfSlides=https://vusec.net/wp- [50] Riccardo Paccagnella, Licheng Luo, and Christopher W. Lord of\ncontent/uploads/2016/11/TalkGras.pdfWeb=https://www.vusec.net/projects/ the Ring(s): Side Channel Attacks on the CPU On-Chip Ring Interconnect Are\nancCode=https://github.com/vusec/revancPress=https://goo.gl/KL4Bta Practical. In Proceedings of the 30th USENIX Security Symposium (USENIX Security\n[30] Daniel Gruss, Erik Kraft, Trishita Tiwari, Michael Schwarz, Ari Trachtenberg, '21). Jason Hennessey, Alex Ionescu, and Anders Fogh. 2019. In [51] Zixuan Pang, Wenhao Wang, and Yong Liao. 2024. Cache Partitioning for MitiProceedings of the 2019 ACM SIGSAC Conference on Computer and Communications gating Timing Side-Channel Attacks in LLM Serving Systems.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 51, + "total_chunks": 59, + "char_count": 998, + "word_count": 111, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f8cce542-a4c7-4f3d-9bda-acad34370a60", + "text": "In 2024 6th InternaSecurity (London, United Kingdom) (CCS '19). Association for Computing Ma- tional Conference on Frontier Technologies of Information and Computer (ICFTIC).\nchinery, New York, NY, USA, 167–180. https://doi.org/10.1145/3319535.3339809 1238–1245. https://doi.org/10.1109/ICFTIC64248.2024.10913329\n[31] Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori [52] Konstantinos Papaioannou and Thaleia Dimitra Doudali. 2024. The Importance\nHashimoto. 2025.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 52, + "total_chunks": 59, + "char_count": 488, + "word_count": 56, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "13e6e2b9-cc67-4703-8f70-53f41ab72a29", + "text": "Auditing Prompt Caching in Language Model APIs. of Workload Choice in Evaluating LLM Inference Systems. In Proceedings of the\narXiv:2502.07776 [cs.CL] https://arxiv.org/abs/2502.07776 4th Workshop on Machine Learning and Systems (Athens, Greece) (EuroMLSys\n[32] Marcus Hähnel, Weidong Cui, and Marcus Peinado. 2017. High-resolution side '24).", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 53, + "total_chunks": 59, + "char_count": 342, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c1e8f032-76c9-4669-a2e1-e82f43c01c3a", + "text": "Association for Computing Machinery, New York, NY, USA, 39–46. https:\nchannels for untrusted operating systems. In Proceedings of the 2017 USENIX //doi.org/10.1145/3642970.3655823\nConference on Usenix Annual Technical Conference (Santa Clara, CA, USA) (USENIX [53] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed\nATC '17). USENIX Association, USA, 299–312.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 54, + "total_chunks": 59, + "char_count": 386, + "word_count": 51, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ce90d7ef-dd50-4acb-8ba1-294f3a732244", + "text": "Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer-\n[33] Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff ence Using Phase Splitting. In 2024 ACM/IEEE 51st Annual International SymRasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash posium on Computer Architecture (ISCA). 118–132. https://doi.org/10.1109/\nBakhtiari, Lev Kurilenko, and Yuxiong He. 2024. DeepSpeed-FastGen: High- ISCA59077.2024.00019\nthroughput Text Generation for LLMs via MII and DeepSpeed-Inference. [54] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan\narXiv:2401.08671 [cs.PF] https://arxiv.org/abs/2401.08671 Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.\n[34] Henry F. The overlapping coefficient as a In Proceedings of the 25th USENIX Security Symposium (USENIX Security '16).\nmeasure of agreement between probability distributions. USENIX Association. [75] Xinyao Zheng, Husheng Han, Shangyi Shi, Qiyan Fang, Zidong Du, Xing Hu,\n[55] R. Xu. 2025. and Qi Guo. 2024.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 55, + "total_chunks": 59, + "char_count": 1057, + "word_count": 137, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "f6036079-5cb5-44a5-92af-4f072aad7404", + "text": "InputSnatch: Stealing Input in LLM Services via Timing\nMooncake: Trading More Storage for Less Computation — A KVCache-Centric Side-Channel Attacks. arXiv:2411.18191 [cs.CR] https://arxiv.org/abs/2411.18191\nArchitecture for Serving LLM Chatbot. In 23rd USENIX Conference on File and [76] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin\nStorage Technologies (FAST 25). 155–170. Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for\n[56] Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. 2025. Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium\nDiscovering the Gems in Early Layers: Accelerating Long-Context LLMs with on Operating Systems Design and Implementation (OSDI 24). USENIX Associa-\n1000x Input Token Reduction. https://openreview.net/forum?id=9iN8p1Xwtg tion, Santa Clara, CA, 193–210. https://www.usenix.org/conference/osdi24/\n[57] Peter Snyder, Soroush Karami, Arthur Edelstein, Benjamin Livshits, and Hamed presentation/zhong-yinmin\nHaddadi. 2023. Pool-party: exploiting browser resource pools for web tracking. [77] Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng,\nIn Proceedings of the 32nd USENIX Conference on Security Symposium (Anaheim, Xin Lv, Xiao Chuanfu, Dahua Lin, and Chao Yang. 2025.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 56, + "total_chunks": 59, + "char_count": 1333, + "word_count": 174, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d3d419a7-8b72-415d-be7d-d0b9b5848464", + "text": "SampleAttention: NearCA, USA) (SEC '23). USENIX Association, USA, Article 397, 15 pages. Lossless Acceleration of Long Context LLM Inference with Adaptive Structured\n[58] Linke Song, Zixuan Pang, Wenhao Wang, Zihao Wang, XiaoFeng Wang, Hongbo Sparse Attention. In Eighth Conference on Machine Learning and Systems. https:\nChen, Wei Song, Yier Jin, Dan Meng, and Rui Hou. 2025. The Early Bird //openreview.net/forum?id=RuZ80yl71h\nCatches the Leak: Unveiling Timing Side Channels in LLM Serving Systems.\n[59] Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and\nWei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation\n(OSDI 24). USENIX Association, Santa Clara, CA, 173–191. https://www.usenix.\norg/conference/osdi24/presentation/sun-biao\n[60] Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu,\nHarry Dong, Yuejie Chi, and Beidi Chen. 2024. ShadowKV: KV Cache in Shadows\nfor High-Throughput Long-Context LLM Inference. arXiv:2410.21465 [cs.LG]\nhttps://arxiv.org/abs/2410.21465\n[61] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song\nHan. 2024. QUEST: query-aware sparsity for efficient long-context LLM inference. In Proceedings of the 41st International Conference on Machine Learning (Vienna,\nAustria) (ICML'24). JMLR.org, Article 1955, 11 pages.\n[62] Hugo Touvron et al. 2023. LLaMA: Open and Efficient Foundation Language\nModels. arXiv preprint arXiv:2302.13971 (2023).\n[63] Pepe Vila and Boris Kopf. 2017. Loophole: Timing Attacks on Shared Event Loops\nin Chrome. In 26th USENIX Security Symposium (USENIX Security 17).", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 57, + "total_chunks": 59, + "char_count": 1686, + "word_count": 232, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "36422def-c0f6-4dff-89f4-d80a992cc8d9", + "text": "USENIX\nAssociation, Vancouver, BC, 849–864. https://www.usenix.org/conference/\nusenixsecurity17/technical-sessions/presentation/vila\n[64] vLLM Team. 2024. vLLM: High-Throughput Serving for Large Language Models.\nhttps://github.com/vllm-project/vllm.\n[65] vLLM Team. 2025. Automatic Prefix Caching in vLLM. https://docs.vllm.ai/en/\nlatest/features/automatic_prefix_caching/.\n[66] Charles V. Wright, Lucas Ballard, Scott E. Coull, Fabian Monrose, and Gerald M.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 58, + "total_chunks": 59, + "char_count": 458, + "word_count": 44, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2120d87a-e445-4e80-8eaa-20802916741a", + "text": "Spot Me if You Can: Uncovering Spoken Phrases in Encrypted\nVoIP Conversations. In 2008 IEEE Symposium on Security and Privacy (sp 2008).\n35–49. https://doi.org/10.1109/SP.2008.21\n[67] Guanlong Wu, Zheng Zhang, Yao Zhang, Weili Wang, Jianyu Niu, Ye Wu, and\nYinqian Zhang. 2025. I Know What You Asked: Prompt Leakage via KV-Cache\nSharing in Multi-Tenant LLM Serving. In Proceedings of the 2025 Network and\nDistributed System Security (NDSS) Symposium. San Diego, CA, USA.\n[68] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike\nLewis. 2024. Efficient Streaming Language Models with Attention Sinks.\n[69] Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang,\nKuntai Du, Shan Lu, and Junchen Jiang. 2025. CacheBlend: Fast Large Language\nModel Serving for RAG with Cached Knowledge Fusion.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 59, + "total_chunks": 59, + "char_count": 817, + "word_count": 124, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "46b1d2b4-8a6e-4c81-b844-b2f80445717d", + "text": "In Proceedings of the\nTwentieth European Conference on Computer Systems (Rotterdam, Netherlands)\n(EuroSys '25). Association for Computing Machinery, New York, NY, USA, 94–109.\nhttps://doi.org/10.1145/3689031.3696098\n[70] Yuval Yarom and Katrina Falkner. 2014. FLUSH+RELOAD: A High Resolution,\nLow Noise, L3 Cache Side-Channel Attack. In Proceedings of the 23rd USENIX\nSecurity Symposium (USENIX Security '14). USENIX Association.\n[71] Lu Ye, Ze Tao, Yong Huang, and Yang Li. 2024.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 60, + "total_chunks": 59, + "char_count": 480, + "word_count": 65, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c5718bd6-4ea8-456d-b0d0-656ed2212eb6", + "text": "ChunkAttention: Efficient SelfAttention with Prefix-Aware KV Cache and Two-Phase Partition. In Proceedings of\nthe 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1:\nLong Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association\nfor Computational Linguistics, Bangkok, Thailand, 11608–11620. https://doi.org/\n10.18653/v1/2024.acl-long.623\n[72] Lingfan Yu, Jinkun Lin, and Jinyang Li. 2025. Stateful Large Language Model\nServing with Pensieve. In Proceedings of the Twentieth European Conference on\nComputer Systems (Rotterdam, Netherlands) (EuroSys '25). Association for Computing Machinery, New York, NY, USA, 144–158. https://doi.org/10.1145/3689031.\n3696086\n[73] Siyan Zhao, Daniel Israel, Guy Van den Broeck, and Aditya Grover. 2024. Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large\nLanguage Models. arXiv:2404.09529 [cs.LG] https://arxiv.org/abs/2404.09529\n[74] L. Gonzalez, et al. 2024. SGLang: Efficient Execution of Structured\nLanguage Model Programs. Advances in Neural Information Processing Systems\n37 (2024), 62557–62583.", + "paper_id": "2603.10726", + "title": "CacheSolidarity: Preventing Prefix Caching Side Channels in Multi-tenant LLM Serving Systems", + "authors": [ + "Panagiotis Georgios Pennas", + "Konstantinos Papaioannou", + "Marco Guarnieri", + "Thaleia Dimitra Doudali" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10726v1", + "chunk_index": 61, + "total_chunks": 59, + "char_count": 1111, + "word_count": 138, + "chunking_strategy": "semantic" + } +] \ No newline at end of file