diff --git "a/data/chunks/2603.10721_semantic.json" "b/data/chunks/2603.10721_semantic.json" new file mode 100644--- /dev/null +++ "b/data/chunks/2603.10721_semantic.json" @@ -0,0 +1,782 @@ +[ + { + "chunk_id": "f9736496-e83f-4934-9a58-19483db24e7f", + "text": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median\nClustering in High dimensions Kangke Cheng1, Shihong Song2, Guanlin Mo1, Hu Ding1*\n1University of Science and Technology of China, Hefei, China\n2School of Informatics, University of Edinburgh, Edinburgh, UK\nke314159@mail.ustc.edu.cn, S.Song-29@sms.ed.ac.uk, moguanlin@mail.ustc.edu.cn, huding@ustc.edu.cn Abstract which provides greater robustness to outliers and heavytailed distributions. Thus k-median clustering is preferable\nIn this paper, we investigate the learning-augmented k- in many practical applications, especially when data is noisy.\nmedian clustering problem, which aims to improve the per-2026 Therefore, our work focuses on the k-median setting, aimformance of traditional clustering algorithms by preprocessing to retain its robustness advantages while addressing the ing the point set with a predictor of error rate α ∈[0, 1).\nalgorithmic challenges through the proposed framework. This preprocessing step assigns potential labels to the points\nbefore clustering. We introduce an algorithm for this prob- Learning-Augmented algorithms. A central challengeMar\nlem based on a simple yet effective sampling method, which in the field of algorithm design lies in simultaneously re-\n11 substantiallyalgorithms. Moreover,improves uponwe mitigatethe timetheircomplexitiesexponentialof existingdepen- ducingreliable algorithmicapproximationtimeratio.complexityThe proliferationwhile maintainingof large-a\ndency on the dimensionality of the Euclidean space. Lastly, scale data and the advancement of machine learning bring\nwe conduct experiments to compare our method with sev- the opportunity to obtain valuable prior knowledge for\neral state-of-the-art learning-augmented k-median clustering many classical algorithmic problems. To overcome the ofmethods.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 0, + "total_chunks": 39, + "char_count": 1842, + "word_count": 226, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1909a4e7-0b32-4ff0-b406-6c8e37e87f67", + "text": "The experimental results suggest that our proposed\nten pessimistic bounds of traditional worst-case analysis,\napproach can significantly reduce the computational comthe theoretical computer science community has introduced plexity in practice, while achieving a lower clustering cost.[cs.DS] learning-augmented algorithms (Hsu et al. 2018; Antoniadis\net al. 2020; Dinitz et al. 2021; Lykouris and VassilvitCode — https://github.com/KangkeCheng/Learning- skii 2021; Mitzenmacher and Vassilvitskii 2022)—a new\nAugmented-k-Median-Sample-and-Search paradigm that falls under the umbrella of \"Beyond WorstCase Analysis\" (Roughgarden 2021). The core idea is to de-\n1 Introduction sign algorithms that can harness auxiliary information, typically from a machine-learned model, to enhance their perAs the core topics in unsupervised learning, k-median and formance.\nk-means clusterings are widely applied to numerous fields, For learning-augmented k-means and k-median clusterlike bioinformatics (Kiselev et al. 2017), computer vi- ing, Gamlath et al. (2022) explored noisy labels, achieving\nsion (Caron et al. 2020), and social network (Ghaffari, (1+O(ϵ))-approximation for balanced adversarial noise and\nMosavi, and Shamshirband 2021). The primary goal of these O(1)-approximation for stochastic noise models. Here, the\ncenter-based clustering problems is to partition a set of unla- approximation ratio is a measure of the solution's quality,\nbeled data points into multiple clusters, such that data points defined as the ratio of the cost of the algorithm's solution\nwithin the same cluster are similar to each other (under some to the cost of the optimal solution. Ergun et al. (2022) demetric), while data points in different clusters exhibit signif- veloped a learning-augmented framework where data points\nicant dissimilarity.arXiv:2603.10721v1 are augmented with predicted labels, quantified by an error\nk-means problem seeks to find k centers that minimize rate α ∈[0, 1). They proposed a randomized algorithm for\nthe sum of squared Euclidean distances from each point k-means problem that achieves a (1 + 20α)-approximation\nto its nearest center. Formally, the goal is to minimize under some specific constraints on α and cluster size in\nPx∈X minc∈C ∥x −c∥22, where X is the input dataset and O(nd log n) time, where n denotes the number of points\nC is the set of k centers. Despite its popularity, k-means is to be clustered and d denotes the dimension of the space.\nknown to be sensitive to outliers and noise, as the squared They also proposed an algorithm for the k-median probdistance objective increases the impact of extreme values 1 lem that, under the condition that α = ˜O k , achieves an quadratically. In contrast, the k-median problem minimizes\n˜O((kα)1/4)-approximation. the sum of Euclidean distances: Px∈X minc∈C ∥x −c∥2,\nNguyen, Chaturvedi, and Nguyen (2023) further im-\n*Corresponding author proved these algorithms. Their k-means algorithm directly\nCopyright © 2026, Association for the Advancement of Artificial estimates locally optimal centers dimension-wise across preIntelligence (www.aaai.org). All rights reserved. dicted clusters, achieving a (1 + O(α))-approximation in Methods Approximation Ratio Label Error Range Time Complexity\nErgun et al. (2022) 1 + O((kα)1/4) ˜O( k)1 O(nd log3 n + poly(k, log n))\n7α+10α2−10α3 1 k\nNguyen, Chaturvedi, and Nguyen (2023) 1 + (1−α)(1−2α) [0, 1/2) O( 1−2αnd log3 n log2 δ )\n(6+ϵ)α−4α2 √ d\nHuang et al. (2025) 1 + (1−α)(1−2α) [0, 1/2) O(nd log(kd) log(n∆) · ( αϵ )O(d)) (6+ϵ)α−4α2 k Sample-and-Search (ours) 1 + (1−α)(1−2α) [0, 1/2) O(2O(1/(αε)4)ndlog δ ) Table 1: A comparison of our Sample-and-Search algorithm with state-of-the-art methods. Here, the terms ϵ > 0 and δ ∈(0, 1)\nare the parameters that control the approximation precision and success probability. ∆denotes the aspect ratio of the given\npoint set.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 1, + "total_chunks": 39, + "char_count": 3881, + "word_count": 577, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "511502dd-37a0-4e09-be83-7aaa236d3a92", + "text": "O(nd log n) time when α ∈[0, 1/2). Their k-median algorithm significantly improves the approximation guarantee\nby employing multiple random samplings and pruning techniques. More recently, Huang et al. (2025) extended the\ndimension-wise estimation method for learning-augmented\nk-means to reduce the time complexity while maintaining a similar approximation ratio by using sampling to\navoid sorting. They also proposed a k-median algorithm\n(6+ϵ)α−4α2\nthat achieves a 1 + (1−α)(1−2α)-approximation, which represents the state-of-the-art in terms of approximation ratio\nfor learning-augmented k-median clustering, as far as we Figure 1: Comparison of the Approximation Ratios for our\nare aware. However, in their work, the structural differences algorithm (set ϵ = 0.1) and the NCN algorithm in term of\nbetween k-means and k-median lead to a fundamental algo- the change of error rate α. This plot shows that our algorithm\nrithmic gap. For k-means, the mean center has a closed-form (green dashed line) consistently achieves a lower approxisolution that can be computed independently across dimen- mation ratio than the NCN algorithm (Nguyen, Chaturvedi,\nsions. In contrast, k-median centers lack closed-form ex- and Nguyen 2023) (blue solid line) across all values of the\npressions and cannot be decomposed dimension-wise, mak- error rate α ∈[0, 1/2).", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 2, + "total_chunks": 39, + "char_count": 1350, + "word_count": 200, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1a71368b-ba10-4696-895a-8ce7a21ebee0", + "text": "The purple shaded area highlights\ning them significantly harder to compute even when point this performance gap, which becomes more pronounced as\nlabel predictions are available. As a result, their k-median α increases.\nmethod needs brute-force grid partitioning and searching\nprocedure in the original high-dimentional space, thus introduces an exponential dependence on d, which is genera simple yet effective algorithm for learning-augmented k-ally considered unacceptable in practice, particularly in high\nmedian clustering. The time complexity is linear in n anddimensional scenarios. Hence, a key open problem is: Is it\nd, avoiding exponential dependence on the dimension d. Atpossible to design an algorithm that achieves the state-ofthe same time, our algorithm achieves an approximation ra-the-art approximation ratio while overcoming the exponen-\n(6+ϵ)α−4α2 1\ntial dependence on the dimension d? tio of 1 + (1−α)(1−2α) for α < 2, matching the state-of-theOur key ideas and main contributions. Furthermore, we conduct a set of experiments on highsight is that for each predicted cluster, the true median of dimensional datasets, demonstrating speedups (up to 10×)\nthe correctly labeled subset lies close to a low-dimensional over prior methods while maintaining relatively high clussubspace spanned by a small random sample. This allows tering quality.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 3, + "total_chunks": 39, + "char_count": 1361, + "word_count": 198, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "dc41e447-b549-42d1-b362-9925dfd84ca7", + "text": "Figure 1 compares the approximation ratio of\nus to efficiently discretize the search space using a low- their algorithm (denoted as NCN) with ours. Table 1 prodimensional grid, thus reducing the computational cost. The vides a detailed comparison for the results of our and existmain technical challenge is that the predicted clusters may ing methods.\nbe noisy—some points are misclassified, and the predicted\ncluster center may be far away from the true one. To tackle 1.1 Preliminaries\nthis obstacle, we design a novel sampling-and-search frame- Notations. Let X denote the input set of n points in Rd.\nwork that can effeictively select appropriate candidate clus- For any two points p, q ∈Rd, their Euclidean distance is\nter centers. The key idea is to utilize a greedy search strategy ∥p −q∥2.\nin the aforementioned low-dimensional grid, which neatly Given any point set C, the distance from a point p to its\navoids to explicitly distinguish between the correclty labeled closest point in C is denoted as dist(p, C) = minc∈C ∥p −\nand mislabeled points. c∥2. In particular, when C is the given set of centers for the\nOur contributions are summarized as follows: We propose point set X, the corresponding cost, denoted as Cost(X, C), Proposition 1.1 shows that the subspace spanned by a sufficient random sample from P is guaranteed to contain a\ngood approximation of the Med(P), which enables us to construct a candidate set of centers by partitioning this subspace\nwith a grid to approximate the optimal center. In Figure 2,\nwe depicts the generation of the subspace from the samples. Proposition 1.2 is used to estimate the average cost of the\nclusters, which in turn guides the design of the grid cell sidelength. Proposition 1.2. (Kumar, Sabharwal, and Sen 2010) Let P\nbe a point set in Rd. Given a parameter ζ ∈(0, 1/12), we\nFigure 2: (a) provides a simplified illustration of how a sub- randomly sample a point p0 and a set S of size 1/ζ from P.\nspace is generated. We sample a subset S (denoted by the Define the value v = Cost(S, p0). Then with the probability\nblue points) from the original point set P (denoted by the (1−ζ2)1/ζ+1\nblack points), and S forms a subspace span(S). (b) shows > 2 , we have:\nthat span(S) contains a projection of Med(P), denoted by\nProj(Med(P)), which is close to Med(P). Moreover, S con- vζ3 ⩽Cost(P, Med(P)) ⩽v .\ntains a point (e.g., s1) that is within a bounded distance from 2 |P| ζ\nMed(P)).", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 4, + "total_chunks": 39, + "char_count": 2435, + "word_count": 422, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "42fec21e-4a9f-4af2-88dd-785d3e45e243", + "text": "Upper bound on error rate α. We consider the label\nerror rate α to be upper-bounded by 1/2. As discussed by\nNguyen, Chaturvedi, and Nguyen (2023), when α reaches\nis defined as Cost(X, C) = Px∈X dist(p, C). With a 1/2, the relationship between the predicted and optimal clusslight abuse of notation, we also use Cost(P, c) to denote ters can break down entirely.\nthe sum of distances from a point set P to a single point c. We denote the optimal k clusters for the given instance X 1.2 Other related work\nas {X∗1, . . . , X∗k}, and the set C∗= {c∗1, . . . , c∗k} contains k-median algorithms. Due to the NP-Hardness of the ktheir corresponding optimal centers. median problem (Cohen-Addad and C.S.Karthik 2019), its\nFor any point set P and we use Med(P) to denote its me- approximate algorithms have been extensively studied over\ndian point, i.e., the past half-century. Although several PTAS (PolynomialTime Approximation Scheme) algorithms have been proMed(P) = arg min X ∥p −q∥2. (1) q∈Rd posed, their running time is exponential in either the dip∈P mension d or the number of clusters k (Arora, Raghavan,\nand Rao 1998; Cohen-Addad, Feldmann, and Saulpic 2021;Therefore, for each 1 ≤i ≤k, c∗i = Med(X∗i ). Kumar, Sabharwal, and Sen 2010), making them impractiDefinition 1.1 (learning-augmented k-median cluster- cal for many settings. Other algorithms include the (3 + ϵ)-\ning). Suppose there exists a predictor that outputs a labeled approximation algorithm via local search proposed by Arya\npartition { ˜X1, ˜X2, . . . , ˜Xk} for X, parameterized by a label et al. (2001) and the 3.25-approximation using the LPerror rate α ∈[0, 1), which satisfies: rounding approach proposed by Charikar and Li (2012).\nk-means alogrithms. Similarly with the k-median prob-\n|˜Xi ∩X∗i | ⩾(1 −α) max(|˜Xi|, |X∗i |) lem, existing PTAS algorithms for the k-means problem exhibit an exponential dependence on either d or k (Cohenwhere | · | denotes the number of points in a set. The goal Addad, Feldmann, and Saulpic 2021; Kumar, Sabharwal,\nof learning-augmented k-median clustering is using such a and Sen 2010). Simultaneously, the widely used k-means++\npartially correct result to compute a center set C ⊂Rd that algorithm has a time complexity of O(ndk) and achieves\nminimizes Cost(P, C). an approximation ratio of O(log k) (Arthur and VassilvitWe also introduce two important propositions on geomet- skii 2007), or O(1) for well-separated data (Ostrovsky et al.\nric median point in Euclidean space, which are essential for 2013).\nour following proofs.\n2 Our Algorithm And Theoretical AnalysisProposition 1.1. (Badoiu, Har-Peled, and Indyk 2002) Let\nP be a point set in Rd. Given two parameters 1 > ε > 0 and In this section, we propose the \"sample-and-search\" al-\nγ > 1, we draw a random sample S from P of size ε3γ log 1ε. gorithm for learning-augmented k-median clustering. Our\nThen, with the probability at least 1−1/γ, the following two main idea is to extract information from predicted labels\nevents occur: (i) The flat span(S) contains a point within through uniform sampling, then leverage the properties of\nϵ·Cost(P,Med(P )) median point (based on Proposition 1.1) to construct a cana distance of |P | from Med(P), where span(S) didate center set in a low-dimensional subspace, and finally\ndenotes the subspace spanned by the points in S. (ii) The set\nCost(P,Med(P )) employ a greedy search approach to find the desired soluS contains a point within a distance of 2 × |P | tion from the candidate center set.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 5, + "total_chunks": 39, + "char_count": 3506, + "word_count": 588, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1cbddf5b-9edd-4237-af29-f49e232b7a03", + "text": "This sample-and-search\nfrom Med(P). strategy avoids searching in the original space which may be much larger than the subspace derived from Proposition 1.1,\nand thereby is able to reduce the total computational com- Algorithm 1: SAMPLE-AND-SEARCH FOR LEARNINGplexity to a great extent. In Section 2.1, we introduce the AUGMENTED k-MEDIAN\ndetailed algorithm and our main theoretical result, i.e., The-\n1: Input: A k-median instance, consisting of a finite set of\norem 2.1. Then, we provide the proof for Theorem 2.1 in points X ⊂Rd and an integer k; a predicted partition\nSection 2.2.\n{ X1,˜ . . . , Xk}˜ of X with error rate α ∈[0, 1/2); an\naccuracy parameter ϵ ∈(0, 1) and failure probability2.1 Our Proposed Algorithm And Main Theorem\nδ ∈(0, 1). We present the Sample-and-Search algorithm in Algo- 2: Output:A set ˆC ∈Rd of centers with |ˆC| = krithm 1. In general, the algorithm consists of three main\nstages: 3: Initialize ˆC ←∅, ζ ← 131\n4: for i ←1 to k do\n1. Sampling-Based Subspace Construction: For each 5: Initialize a candidate set Ci ←∅\npredicted cluster, we sample a small subset of points l log(δ/k) m\n6: for j ←1 to log(0.975) do to form a \"basis\" that captures a \"neighbor\" within a\nbounded distance from the optimal cluster center. Note 7: Samplings: first, randomly sample a point yji ∈\nthat once we have found such a basis, we only need to Xi,˜ and then sample two separated sets from Xi˜\nsearch within the low-dimensional subspace spanned by uniformly at random: a set Qji ⊆ Xi˜ of size\nαϵ this small basis to approximate the optimal center of the 2 log(1/( 2 ))\nαϵ ⌉ ⌈ (1−α)ζ ⌉, and a set Rji ⊆˜Xi of size ⌈4(1−α)( cluster. This allows the size of our search space to depend 2 )3\nonly on ϵ, not on the dimension d. 8: for each subset Q ⊆Qji of size 1/ζ do\n2. Grid-based Candidate Generation: After generating k 9: v ←Cost(Q, yji )\nappropriate subspaces where each one of them is suffi- 10: a ←vζ32 and b ←vζ ciently close to the corresponding optimal centers, we\nconstruct a grid structure in each of the subspaces to gen- 11: for each integer l ∈{⌊log2 a⌋, . . . , ⌈log2 b⌉}\ndo erate k small candidate sets of center points. This elim-\n12: t ←2l inates the need to search the original high-dimensional\nspace. 13: Run Algorithm 2: S ←CSC(Rji, t, α, ϵ)\n14: Ci ←Ci ∪S\n3. Greedy Center Selection: We select the best center from 15: end for\nthe candidate set using a cost-minimization greedy selec- 16: end for\ntion procedure. 17: end for\nWe present the main theoretical result of our algorithm 18: For each c ∈Ci, define Ni(c) as the set of ⌈(1 −\nbelow. α)|Xi|⌉points˜ in Xi˜ closest to c\nTheorem 2.1. Let 1 > ϵ > 0 and 1 > δ > 0 be two 19: Greedy selection: find the best candidate for the i-th\nparameters. Given an instance X as described in Defini- cluster center, ˆci ←arg minc∈Ci Px∈Ni(c) ∥x −c∥2\ntion 1.1, if we assume the error rate α < 12, then Algo- 20: ˆC ←ˆC ∪{ˆci}\nrithm 1 can output a solution with the approximation ratio 21: end for\n6α−4α2+ϵα with probability 1 −δ. The time com- 22: return ˆC 1 + (1−α)(1−2α)\nplexity is O(2O(1/(αε)4)ndlog kδ ).", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 6, + "total_chunks": 39, + "char_count": 3077, + "word_count": 569, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "2911b9d1-4191-4f8e-9f56-3be2a6df79f3", + "text": "2.2 Proof of Theorem 2.1\nAlgorithm 2: CANDIDATE SET CONSTRUCTION (CSC)We divide the proof into three main steps: First, we establish that for each predicted cluster, with high probability, 1: Input: A point set R ⊂Rd; an approximate average\nthe constructed candidate center set contains at least one cost t > 0; parameters α and ϵ.\npoint that is close to the true median of the correctly la- 2: Output: A set of candidate centers S ⊂Rd.\nbeled subset of the predicted cluster. This is formalized in 3: Initialize S ←∅, θ ← 4|R|αϵt\nLemma 2.2, which leverages the geometric properties from 4: for each point r ∈R do\nProposition 1.1 and Proposition 1.2 under our sampling de- 5: Initialize a candidate set Sr ←∅\nsign. Second, we analyze the cost of the selected center from 6: Construct a grid Gr on span(R) centered at r with\nthe candidate set. In Lemma 2.3, we show that this cen- side-length θ\nter yields a clustering cost close to the optimal one, despite 7: Define a ball B(r, 2t) ←{x ∈Rd | ∥x −r∥2 ≤2t}\nthe noisy labels, by carefully bounding the additional cost 8: Sr ←Gr ∩B(r, 2t)\nincurred by misclassified points and the optimality of the 9: S ←S ∪Sr\ngreedy choice. 10: end for\nFinally, we aggregate the bounds over all clusters to ob- 11: return S\ntain the total clustering cost, and analyze the size of the candidate set and runtime of our algorithm. For convenience, we denote the intersection of Xi˜ and X∗i where the second inequality comes from inequality (4).\nas Ti, i. e. , Ti = Xi˜ ∩X∗i . Combining inequality (5) and inequality (6), by triangle inequality, we have\nLemma 2.2. For predicted cluster Xi,˜ with a probability of\n1 −δk, there exists a point q ∈˜Ci satisfying: ||q −Med(Ti)||2 ⩽αϵ × Cost(Ti, Med(Ti)) .\n|Ti|\n||q −Med(Ti)||2 ⩽αϵ × Cost(Ti, Med(Ti)) . (2) Now, we calculate the probability that all events suc-\n|Ti| ceed in a single trial. The combined success probability\nProof. First, under the learning-augmented setting, we have (1−ζ2)1/ζ+1 (1−ζ2)1/ζ+1 e − 1−ζζ > is . We have 32 32 32 ≥0.025|Ti| ⩾(1 −α) max(|Xi|,˜ |X∗i |) > 2|1 Xi|.˜ As we uniformly ∈(0, when ζ 1/12). Here, the first inequality is a disample a point yi from Xi,˜ and uniformly sample a set Qji rect application of ln(1 −x) > − 1−xx and the second infrom Xi˜ with size (1−α)ζ2 in the first stage of our algo- equality is obtained by leveraging the fact that the function\nrithm, by employing Markov's inequality, we deduce that, is monotonically decreasing. Therefore, the probability of\nwith probability at least 12 × 12 = 4,1 the following two events failure for each trial is less than 0.975. Since we perform\nlog(δ/k) moccur simultaneously: l runs, the overall success probability is therefore log(0.975)\nlog(δ/k)\n⩾1 . greater than 1 −0.975 log(0.975) = 1 −δk. yi ∈Ti, |Ti ∩Qji| ζ\nWe now turn to evaluate the clustering cost incurred by\nNow assume both of these events occur.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 7, + "total_chunks": 39, + "char_count": 2875, + "word_count": 511, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1648a1ff-7afe-4327-82c4-ca28e3a17346", + "text": "There exists a subset\n1 the selected centers. Lemma 2.3 plays a central role in this\nQ ⊆(Ti ∩Qji) of size ζ . According to Proposition 1.2, if analysis—it quantifies how far the selected center might be\nwe set p0 = yji , S = Q, P = Ti, for v = Cost(Q, yji ), we from the true cluster center due to noisy labels and sampling\nhave variance, and how this error translates into overall clustering\ncost.\n˜ we have: vζ3 ⩽Cost(Ti, Med(Ti)) ⩽v (3) Lemma 2.3. For each predicted cluster Xi, 2 |Ti| ζ\n6α −4α2 + αϵ (1−ζ2)1/ζ+1 Cost(X∗i , ˆci) ⩽ 1 + Cost(X∗i , c∗i ).with probability at least 2 . In the second stage (1 −α)(1 −2α)\nof our algorithm, we iterate over all subsets of Qji of\n1 Proof. A critical component of the analysis is to relate the\nsize ζ , therefore, there exists an integer l in the interval selected center ˆcito the true center c∗i and the median of cor- vζ3 v\n⌊log 2 , log ζ ⌉such that t = 2l satisfies rectly labeled points Med(Ti). We begin by splitting the cost\ninto two parts,\nt/2 ⩽Cost(Ti, Med(Ti)) ⩽t. (4) Cost(X∗i , ˆci) = Cost(Ti, ˆci) + Cost(X∗i \\Ti, ˆci) (7) |Ti| We focus on the term \"Cost(Ti, ˆci)\" first. To establish\nSimilarly, as we uniformly sample a set Rji from Xi˜ with the equality, we first compute the additional cost incurred αϵ 4 log(1/ ) 2\nαϵ in the first stage of our algorithm, by em- by assigning points in Ti to ˆci by decomposing the setssize (1−α)( )3 2\nαϵ ∩ into three disjoint partitions: Ti and Ni(ˆci) S1 = Ti 2 )\nαϵploying Markov's inequality, we have |Ti∩Rji| ⩾2 log(1/( \\ This im- Ni(ˆci), S2 = Ti Ni(ˆci), S3 = Ni(ˆci) \\ Ti. 2 )3\nwith probability at least 1/2. Thus, according to the Propo- plies Ti = S1 ∪S2 and Ni(ˆci) = S1 ∪S3. Then the\nsition 1.1, with probability at least 1/2 , the following two Cost(Ti, ˆci) −Cost(Ti, Med(Ti)) can be written as\nevents happen:\nCost(Ti, ˆci) −Cost(Ti, Med(Ti))\n1. The flat span(Rji) contains a point at a distance ⩽ = [Cost(S1, ˆci) + Cost(S2, ˆci)]\nαϵCost(Ti,Med(Ti))\nfrom Med(Ti), −[Cost(S1, Med(Ti)) + Cost(S2, Med(Ti))] 2|Ti|\n2. Rji contains a point at a distance ⩽2 × Cost(Ti,Med(Ti))|Ti| = [Cost(S1, ˆci) + Cost(S3, ˆci)]\nfrom the center Med(Ti). −[Cost(S1, Med(Ti)) + Cost(S3, Med(Ti))]\n−[Cost(S3, ˆci) −Cost(S3, Med(Ti))]So, the flat span(Rji) contains a point o such that\n+ [Cost(S2, ˆci) −Cost(S2, Med(Ti))]\n× Cost(Ti, Med(Ti)) ⩽αϵ . (5) = [Cost(Ni(ˆci), ˆci) −Cost(Ni(ˆci), Med(Ti))] ||o −Med(Ti)||2\n2|Ti| −[Cost(Ni(ˆci) \\ Ti, ˆci) −Cost(Ni(ˆci) \\ Ti, Med(Ti))]\nTherefore, under the construction of the grid in Algorithm 2, + [Cost(Ti \\ Ni(ˆci), ˆci) −Cost(Ti \\ Ni(ˆci), Med(Ti))]\nthere must exist a point q ∈S satisfying (8)\nWe also have |Ti \\ Ni(ˆci))| ⩽α|Xi|˜ and |Ni(ˆci)) \\ ⩽αεt ⩽αε × Cost(Ti, Med(Ti)) , (6) ||q −o||2\n4 2|Ti| Ti| ⩽α|Xi|.˜ So we can find an upper bound for Cost(Ti \\ Ni(ˆci), ˆci)−Cost(Ti\\Ni(ˆci), Med(Ti)) by triangle inequal- We now proceed to formally prove Theorem 2.1 by estabity as lishing both the approximation ratio and the runtime comCost(Ti \\ Ni(ˆci), ˆci) −Cost(Ti \\ Ni(ˆci), Med(Ti)) plexity.\n⩽|Ti \\ Ni| × ||Med(Ti) −ˆci||2 ⩽α|Xi|||Med(Ti)˜ −ˆci||2.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 8, + "total_chunks": 39, + "char_count": 3085, + "word_count": 551, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "0f381e1f-c746-48b8-a302-1ef119f8ba92", + "text": "Proof of Theorem 2.1. Our first step is to compute the ap-\n(9) proximation ratio of the algorithm. In each cluster, by\nWe can obtain inequality ∥Med(Ti) − ˆci||2 ⩽ Lemma 2.3, we obtain\n(2+αϵ)Cost(X∗ i ,c∗i )\n(1−2α)|Xi|˜˜ through triangle inequality (the de- Cost(X∗i , ˆci) ⩽ 1 + 6α −4α2 + ϵα Cost(X∗i , c∗i ). (1 −α)(1 −2α)tailed derivation is provided in the full version of the\ni ,c∗i ) . Therefore, for the entire instance, we havepaper.).", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 9, + "total_chunks": 39, + "char_count": 443, + "word_count": 81, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7e134ff8-3192-490b-8206-e8a97cf0134e", + "text": "Then we have inequality (9) ⩽(2α+α2ϵ)Cost(X∗1−2α\nSimilarly, we obtain\nX Cost(X∗i , {bcj}kj=1) Cost(Ni(ˆci) \\ Ti, ˆci) −Cost(Ni(ˆci) \\ Ti, Med(Ti)) i∈[k]\ni , c∗i ) ⩽(2α + α2ϵ)Cost(X∗ . (10) ⩽ 1 + 6α −4α2 + ϵα X Cost(X∗i , c∗i ). 1 −2α (1 −α)(1 −2α)\nNow we find a upper bound for Cost(Ni(ˆci), ˆci) − i∈[k]\nCost(Ni(ˆci), Med(Ti)). Because of our greedy selec- We now assess the time complexity of the algorithm. This\ntion, we have Cost(Ni(ˆci), ˆci) ⩽ Cost(Ni(q), q) ⩽ involves analyzing the size of the candidate center set genCost(Ni(ˆci), q). So erated via sampling and grid discretization, and the cost inCost(Ni(ˆci), ˆci) −Cost(Ni(ˆci), Med(Ti)) curred in evaluating all candidate centers.\n⩽Cost(Ni(ˆci), q) −Cost(Ni(ˆci), Med(Ti)) First, we compute the size of set of candidate centers. The\nsize of the candidate center set we ultimately construct is\n⩽|(1 −α)|Xi||˜ × ||q −Med(Ti)||2. (11)\nBy applying Lemma 2.2, we can obtain inequality (11) ⩽ 1 1 O(|R|) k !\nO |R| log log\nαϵ × Cost(Ti, Med(Ti)). Putting inequality (9), inequality (αϵ)4 (αϵ) δ\n(10) and inequality (11) together, we obtain the following\nO( 1 ) kbound for the left side of equation (7) O( (αϵ)31 log2 (αϵ)1 ) k (αϵ)4 = O 2 log ⩽O 2 log . Cost(Ti, ˆci) −Cost(Ti, c∗i ) δ δ\n⩽Cost(Ti, ˆci) −Cost(Ti, Med(Ti)) For each candidate point within the candidate center set, the\n⩽(4α + ϵα)Cost(X∗i , c∗i ) . (12) timeoverallneededtime complexityto calculate ofitsthecostalgorithmis |Xi|d.˜ isConsequently, the 1 −2α\nNext, we consider the second term \"Cost(X∗i \\Ti, ˆci)\" in\nO( (αϵ)4 (αϵ)4equation (7). By triangle inequality X 2 O( 1 )nd log k . 1 )|Xi|d˜ log k = 2\nCost(X∗i \\Ti, ˆci) ⩽Cost(X∗i /Ti, c∗i ) i∈[k] δ δ\n+ |X∗i \\Ti| × ||ˆci −c∗i ||2, (13) Next, we analyze the success probability of the algorithm. Subsequently, we bound |X∗i \\Ti|. It follows from Defini- As we obtained in Lemma 2.2, the success probability in\ntion 1.1 that (1 −α)|X∗i | ⩽|Ti|, |Ti| ⩽|Xi|,˜ so, we can each cluster of the algorithm is 1−δk, therefore, by the union\nbound |X∗i \\Ti| as bound, the overall success probability of the algorithm ⩾\nXi|˜ 1 −k × kδ = 1 −δ. ⩽α| (14) |X∗i \\Ti| = |X∗i | −|Ti| ⩽α|X∗i | 1 −α.\n3 Experiment\nCombining (13), (14), and the inequality ∥ˆci −c∗i ||2 ⩽\n(2+αϵ)Cost(X∗i ,c∗i ) We evaluated our algorithms on real-world datasets. The we obtain in the full version of the paper, (1−2α)|Xi|˜ experiments were conducted on a server with an Intel(R)\nwe have Xeon(R) Gold 6154 CPU and 1024GB of RAM. For all exCost(X∗i \\Ti, ˆci) −Cost(X∗i \\Ti, c∗i ) periments, we report the average clustering cost and its standard deviation over 10 independent runs.\n⩽α(2 + αϵ)Cost(X∗i , c∗i ) . (15) Datasets. Following the work of Nguyen, Chaturvedi,\n(1 −α)(1 −2α) and Nguyen (2023), Ergun et al. (2022) and Huang et al. Now, we derive the final approximation guarantee. Com- (2025), we evaluate our algorithms on the CIFAR-10 (n =\nbining inequality (12) and inequality (15), we have 50, 000, d = 3, 072) (Krizhevsky and Hinton 2009), PHY\nCost(X∗i , ˆci) = Cost(Ti, ˆci) + Cost(X∗i /Ti, ˆci) (n = 10, 000, d = 50) (KDD 2004), and MNIST (n =\n1, 797, d = 64) (Deng 2012) datasets using a range of error\n6α −4α2 + ϵα\n⩽ 1 + Cost(X∗i , c∗i ). rates α.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 10, + "total_chunks": 39, + "char_count": 3197, + "word_count": 574, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6690329e-7633-49eb-a4eb-831ac55ca30e", + "text": "We additionally evaluated our algorithm's perfor-\n(1 −α)(1 −2α) mance on another high dimensional dataset Fashion-MNIST\n(n = 60000, d = 784) (Xiao, Rasul, and Vollgraf 2017) Predictor Generation and Error Simulation To eval- significant increase in time complexity? Furthermore, could\nuate our algorithms, we first computed a ground-truth par- a learning-augmented clustering algorithm be designed for\ntition for each dataset using Lloyd's algorithm initialized the streaming model to more effectively handle large-scale\nwith KMedoids++(denoteed as KMed++). We then gener- data?\nated corrupted partitions with the error rate, α, by randomly\nselecting an α fraction of points in each true cluster and reas- Condition Cost Time(s)\nsigning them to randomly chosen cluster(denoted as Predic- α Method Avg. To ensure a fair comparison, every algorithm was tested\non the exact same set of corrupted labels for any given error 0 KMed++ 8.4054e+07 - - -\nrate α. Predictor 8.4259e+07 - - -\nEFS+ 8.4050e+07 115.17 270.47 12.97 Algorithms In our experiments, we evaluate our proposed\n0.1 HFH+ 8.4049e+07 834.92 749.72 18.47\nSample-and-Search algorithm . We compare its performance NCN 8.4050e+07 181.80 272.22 4.37\nagainst other state-of-the-art learning-augmented methods, Ours 8.4048e+07 933.64 47.37 0.78\nincluding the algorithm from Ergun et al. (2022) (denoted as Predictor 8.4935e+07 - - -\nEFS+), Nguyen, Chaturvedi, and Nguyen (2023) (denoted EFS+ 8.4057e+07 287.83 283.13 24.07\nas NCN) and the recent work by Huang et al. (2025) (de- 0.2 HFH+ 8.4053e+07 1598.91 751.66 25.42\nnoted as HFH+). As noted by Nguyen, Chaturvedi, and NCN 8.4057e+07 309.52 282.97 13.78\nNguyen (2023), the true error rate α is generally unknown Ours 8.4052e+07 961.03 47.96 3.33\nin practice, which necessitates a search for its optimal value. Predictor 8.6223e+07 - - -\nTo ensure a fair comparison, we implement a uniform hy- EFS+ 8.4076e+07 467.75 282.57 8.66\nperparameter tuning strategy for all evaluated algorithms 0.3 HFH+ 8.4065e+07 3527.00 751.13 22.67\nNCN 8.4077e+07 695.49 299.50 22.54. Specifically, we iterate over 10 candidate values for α,\nOurs 8.4062e+07 3848.20 45.38 1.33\nwhich are chosen from uniformly spaced points in the in- Predictor 8.8209e+07 - - -\nterval [0.01, 0.5]. For each method, the α that minimizes EFS+ 8.4109e+07 631.67 297.27 13.66\nthe resulting k-median clustering cost is chosen to pro- 0.4 HFH+ 8.4101e+07 11512.25 758.16 29.89\nduce the final output. To assess the final clustering quality NCN 8.4111e+07 1206.20 302.45 25.71\nagainst the ground-truth labels, we additionally report the Ours 8.4100e+07 12684.42 45.29 2.18\nAdjusted Rand Index (ARI) and Normalized Mutual Infor- Predictor 9.0897e+07 - - -\nmation (NMI) in the full version of the paper.. EFS+ 8.4150e+07 1320.47 304.67 11.82\nResults We present a comparative evaluation of our al- 0.5 HFH+ 8.4148e+07 12671.01 751.08 26.53\ngorithm against several baselines in Table 2 and Table 3. NCN 8.4152e+07 2503.70 305.95 10.98\nOurs 8.4145e+07 17385.62 47.87 2.04Table 2 details the performance on the Fashion-MNIST\n(n = 60000, d = 784) for a fixed k = 10 across a\nTable 2: Performance comparison on Fashion-MNISTrange of α values. Table 3 shows the results on the PHY\ndataset with k = 10 and varied α.(n = 10, 000, d = 50) with a fixed α = 0.2 for various\nchoices of k.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 11, + "total_chunks": 39, + "char_count": 3331, + "word_count": 526, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b2b17bea-3372-4d0a-a3cc-93f58b5006eb", + "text": "Both sets of results demonstrate that our algorithm is substantially faster than all competing methods\nwhile generally achieving better approximation quality. Ad- Condition Cost Time(s)\nditional experiments on other datasets and a more detailed\nk Method Avg. Dev.\npresentation of the results are available in the supplemenKMed++ 2.0224e+08 - - -\ntary material. On these datasets, our algorithm also demon- Predictor 2.0427e+08 - - -\nstrates significant advantages in terms of both running time EFS+ 2.0204e+08 4444.80 362.31 52.30\nand cost. 10 HFH+ 2.0147e+08 105661.96 42.33 1.91\nNCN 2.0163e+08 109131.95 160.15 14.33\n4 Conclusion and Future work Ours 2.0134e+08 82812.18 20.72 0.59\nIn this paper, we study the learning-augmented k-median KMed++ 8.4404e+07 - - -\nPredictor 8.5018e+07 - - -clustering problem. We first introduce an algorithm for this\nEFS+ 8.4490e+07 721.59 294.21 54.13\nproblem based on a simple yet effective sampling method, 30\nHFH+ 8.4404e+07 4372.98 42.26 2.05\nthen study its quality guarantees in theory, and finally con- NCN 8.4480e+07 14266.80 221.81 49.30\nduct a set of experiments to compare with other learning- Ours 8.4404e+07 3043.38 27.08 3.26\naugmented k-median algorithms. Both theoretical and ex- KMed++ 6.2758e+07 - - -\nperimental results demonstrate that our method achieves the Predictor 6.3111e+07 - - -\nstate-of-the-art approximation ratio with higher efficiency EFS+ 6.2796e+07 503.13 285.51 22.48\nthan existing methods. Following this work, there are several HFH+ 6.2755e+07 1072.71 44.69 1.34\nopportunities to further improve our methods from both the- NCN 6.2791e+07 5662.71 208.89 29.26\noretical and practical perspectives. For example, is it possi- Ours 6.275456e+07 677.03 36.87 0.73\nble to further reduce the time complexity of the algorithm by\nTable 3: Performance comparison on PHY dataset with fixedmitigating or eliminating the exponential dependence on ϵ?\nα = 0.2 and varied k.Can the approximation ratio be further improved without a 5 Acknowledgments Deng, L. 2012. The MNIST Database of Handwritten Digit\nImages for Machine Learning Research [Best of the Web]. This work was supported in part by the NSFC through grants\nIEEE Signal Processing Magazine, 29(6): 141–142. No. 62432016 and No. 62272432, the National Key R&D\nprogram of China through grant 2021YFA1000900, and the Dinitz, M.; Im, S.; Lavastida, T.; Moseley, B.; and VassilProvincial NSF of Anhui through grant 2208085MF163. vitskii, S. 2021. Faster matchings via learned duals. In\nThe authors would also like to thank the anonymous review- Proceedings of the 35th International Conference on Neuers for their valuable comments and suggestions. ral Information Processing Systems, NIPS '21. Red Hook,\nNY, USA: Curran Associates Inc.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 12, + "total_chunks": 39, + "char_count": 2747, + "word_count": 413, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9d272404-8e51-4b4e-93cd-9bb1cd9ef282", + "text": "C.; Feng, Z.; Silwal, S.; Woodruff, D.; and Zhou,\nS. 2022. Learning-Augmented $k$-means Clustering. In\nAntoniadis, A.; Gouleakis, T.; Kleer, P.; and Kolev, P. International Conference on Learning Representations.\n2020. Secretary and online matching problems with maGamlath, B.; Lattanzi, S.; Norouzi-Fard, A.; and Svensson,chine learned advice. In Proceedings of the 34th InternaO. 2022. Approximate Cluster Recovery from Noisy La-tional Conference on Neural Information Processing Sysbels.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 13, + "total_chunks": 39, + "char_count": 490, + "word_count": 65, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "d01134ef-e2bc-4151-9cf6-0462bcae84ef", + "text": "In Loh, P.-L.; and Raginsky, M., eds., Proceedings oftems, NIPS '20. Red Hook, NY, USA: Curran Associates\nThirty Fifth Conference on Learning Theory, volume 178Inc. ISBN 9781713829546.\nof Proceedings of Machine Learning Research, 1463–1509. Arora, S.; Raghavan, P.; and Rao, S. 1998.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 14, + "total_chunks": 39, + "char_count": 283, + "word_count": 42, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e8c6ece3-7abb-4f99-8b23-2f44f9a5231e", + "text": "Approxima- PMLR.\ntion schemes for Euclidean k-medians and related problems. Ghaffari, M.; Mosavi, A.; and Shamshirband, S. 2021. Clus-In Proceedings of the Thirtieth Annual ACM Symposium\ntering and high-dimensional representation of social net-on Theory of Computing, STOC '98, 106–113. New York,\nwork users' behavior for bot detection. In Companion Pro-NY, USA: Association for Computing Machinery. ISBN\nceedings of the Web Conference 2021, 19–22.0897919629. Hsu, C.-Y.; Indyk, P.; Katabi, D.; and Vakilian, A. 2018. Arthur, D.; and Vassilvitskii, S. 2007. k-means++: the adLearning-Based Frequency Estimation Algorithms. In Invantages of careful seeding. In Proceedings of the Eighternational Conference on Learning Representations.\nteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, 1027–1035. USA: Society for Industrial Huang, J.; Feng, Q.; Huang, Z.; Zhang, Z.; Xu, J.; and Wang,\nand Applied Mathematics.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 15, + "total_chunks": 39, + "char_count": 926, + "word_count": 128, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "9cc1b44f-f1cf-4676-967a-415591a880a2", + "text": "New Algorithms for the Learning-Augmented kmeans Problem. In The Thirteenth International Conference\nArya, V.; Garg, N.; Khandekar, R.; Meyerson, A.; Muna- on Learning Representations.\ngala, K.; and Pandit, V. 2001. Local search heuristic for kmedian and facility location problems.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 16, + "total_chunks": 39, + "char_count": 282, + "word_count": 39, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7a473cae-ac7d-4d3f-8f70-1e040589025d", + "text": "In Proceedings of the Inaba, M.; Katoh, N.; and Imai, H. 1994. Applications of\nThirty-Third Annual ACM Symposium on Theory of Comput- weighted Voronoi diagrams and randomization to varianceing, STOC '01, 21–29. New York, NY, USA: Association for based k-clustering: (extended abstract). In Proceedings of\nComputing Machinery. ISBN 1581133499. the Tenth Annual Symposium on Computational Geometry,\nSCG '94, 332–339. New York, NY, USA: Association for\nBadoiu, M.; Har-Peled, S.; and Indyk, P. 2002. Approximate Computing Machinery. ISBN 0897916484.\nclustering via core-sets. In Proceedings of the Thiry-Fourth\nKDD. 2004. KDD Cup. https://osmot.cs.cornell.edu/Annual ACM Symposium on Theory of Computing, STOC\nkddcup/index.html.'02, 250–257. New York, NY, USA: Association for Computing Machinery.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 17, + "total_chunks": 39, + "char_count": 794, + "word_count": 108, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6d785e71-5f05-4ca1-849d-d7b74d361461", + "text": "Y.; Kirschner, K.; Schaub, M. T.; Andrews, T.;\nYiu, A.; Chandra, T.; Natarajan, K. N.; Reik, W.; Barahona,Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski,\nM.; Green, A. R.; and Hemberg, M. 2017. SC3: consen-P.; and Joulin, A. 2020. Unsupervised learning of visus clustering of single-cell RNA-seq data. Nature methods,sual features by swapping assignments between views. In\n14(5): 483–486.Advances in Neural Information Processing Systems, volume 33, 9912–9924. Krizhevsky, A.; and Hinton, G. 2009. Learning multiple layers of features from tiny images. Technical Report 0, UniCharikar, M.; and Li, S. 2012. A dependent LP-rounding\nversity of Toronto, Toronto, Ontario.approach for the k-median problem. In Proceedings\nof the 39th International Colloquium Conference on Au- Kumar, A.; Sabharwal, Y.; and Sen, S. 2010. Linear-time\ntomata, Languages, and Programming - Volume Part I, approximation schemes for clustering problems in any diICALP'12, 194–205.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 18, + "total_chunks": 39, + "char_count": 963, + "word_count": 139, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "81149f55-e93b-473c-9bee-667e6f3f0eb9", + "text": "Berlin, Heidelberg: Springer-Verlag. mensions. Lykouris, T.; and Vassilvitskii, S. 2021. Competitive\nCohen-Addad, V.; and C.S.Karthik. 2019. Inapproximabil- Caching with Machine Learned Advice. ACM, 68(4).\nity of Clustering in Lp Metrics.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 19, + "total_chunks": 39, + "char_count": 238, + "word_count": 30, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "94e8f2f4-b94a-411b-b59e-3a0f94558707", + "text": "In 2019 IEEE 60th Annual Mitzenmacher, M.; and Vassilvitskii, S. 2022. Algorithms\nSymposium on Foundations of Computer Science (FOCS), with predictions. ACM, 65(7): 33–35.\n519–539. D.; Chaturvedi, A.; and Nguyen, H. 2023. ImCohen-Addad, V.; Feldmann, A. E.; and Saulpic, D. 2021. proved Learning-augmented Algorithms for k-means and kNear-linear Time Approximation Schemes for Clustering in medians Clustering. In The Eleventh International ConferDoubling Metrics. ACM, 68(6). ence on Learning Representations. Ostrovsky, R.; Rabani, Y.; Schulman, L. J.; and Swamy, C. which directly implies\n2013.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 20, + "total_chunks": 39, + "char_count": 597, + "word_count": 82, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "c1037b9a-65dc-4ffb-987c-840843d30959", + "text": "The effectiveness of lloyd-type methods for the k- ⩽(2 + αϵ)Cost(X∗i , c∗i ) . (23) ||ˆci −Med(Ti)||2means problem. ACM, 59(6). (1 −2α)|Xi|˜\nRoughgarden, T. 2021. Beyond the Worst-Case Analysis of\nHere we have completed the proof of the first inequality . Cambridge University Press.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 21, + "total_chunks": 39, + "char_count": 283, + "word_count": 45, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b944ee19-5c80-4f18-bac2-3972e20aed41", + "text": "The proof of the second inequality is similar, like inequalXiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-MNIST: ity (18), we have\na Novel Image Dataset for Benchmarking Machine Learning\nX ||p −ˆci||2 + ||p −Med(X∗i )||2Algorithms. ArXiv, abs/1708.07747.\np∈Ni(ˆci)∩X∗i\n⩾|Ni(ˆci) ∩X∗i | × ||ˆci −Med(X∗i )||2 A missing proof for k-median\n⩾(1 −2α)|Xi| × ||ˆci −Med(X∗i )||2, (24)Claim A.1. The two distances ||ˆci −Med(Ti)||2 and ||ˆci −\ni ,c∗i ) the second inequality comes from Ti ⊆X∗i , so |Ni(ˆci) ∩ .c∗i ||2 are both no larger than (2+αϵ)Cost(X∗ (1−2α)|Xi|˜˜ X∗ Beacuse ⩾|Ni(ˆci) ∩T i∗ |. i | First, due to our greedy search approach, we X ||p −ˆci||2 ⩽Cost(Ni(ˆci), ˆci)\ncan easily establish the inequality relationship between p∈Ni(ˆci)∩X∗i\nCost(Ni(ˆci), ˆci) and Cost(Ti, Med(Ti)), we have\n⩽(1 + αϵ)Cost(X∗i , c∗i )\nCost(Ni(ˆci), ˆci) ⩽Cost(Ni(q), q). X ||p −Med(X∗i )||2 ⩽Cost(X∗i , c∗i ),\nAs Ni(q) is the set of the nearest points from ˆci of size (1 − p∈Ni(ˆci)∩X∗i\nα)|Xi|,˜ we also have we have\nCost(Ni(q), q) ⩽Cost(Ti, q). ⩽(2 + αϵ)Cost(X∗i , c∗i ) . ||ˆci −c∗i ||2\n(1 −2α)|Xi|˜\nWe have\nThis is the second inequality we aimed to prove. Cost(Ti, q) ⩽(1 + αϵ)Cost(Ti, Med(Ti)). B Algorithm for k-means\nFrom the inequalities presented above, it follows that\nNotations. For k-means, given any point set C, the disCost(Ni(ˆci), ˆci) ⩽(1 + αϵ)Cost(Ti, Med(Ti)). (16) tance from a point p to its closest point in C is denoted as dist2(p, C). In particular, when C is the givenAccording to the definition of learing-augmented, |Ti| ⩾ set of centers for the point set X, the corresponding cost,\n(1 −α)ni, and because |Ni(ˆci)| = (1 −α)ni and denoted as Cost2(X, C), is defined as Cost2(X, C) =\nNi(ˆci), Ti ⊆˜Xi, We can derive Px∈X dist2(x, C). For any point set P and we use Cen(P)\n|Ni(ˆci) ∩Ti| ⩾|Ni(ˆci)| −|Xi\\Ti|˜ ⩾(1 −α −α)|Xi|˜ to denote its means point, i.e.,\nCen(P) = arg min X ∥p −q∥22.\nq∈Rd = (1 −2α)|Xi|.˜ (17)\np∈P\nThen, according to the triangle inequality and inequality (17) In this section, we extend the Sample-and-Search algoit follows that rithm to Learning-augmented k-means problem. Our apX ||p −ˆci||2 + ||p −Med(Ti)||2 (18) proach still proceeds in three stages, but differs from al- gorithm 1 in two aspects: we modify the number of samp∈Ni(ˆci)∩Ti pled points and the method for building the candidate center\n⩾|Ni(ˆci) ∩Ti| × ||ˆci −Ti||2 set. Specifically, we first sample a constant-size set of data\n⩾(1 −2α)|Xi|˜ × ||ˆci −Med(Ti)||2. (19) points, then construct candidate center sets in time exponential in the sample size, and finally identify locally optimal\nAlso by inequality (16), we have centers in time linear in the dataset size. X ||p −ˆci||2 ⩽Cost(Ni(ˆci), ˆci) B.1 Our Proposed Algorithm And Main Theorem\np∈Ni(ˆci)∩Ti The detailed implementation of the algorithm is described\n⩽(1 + αϵ)Cost(X∗i , c∗i ) (20) in algorithm 3. Table 4 provides a detailed comparison of\nresults for Learning-Augmented k-means algorithms.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 22, + "total_chunks": 39, + "char_count": 2959, + "word_count": 491, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "e403de5f-ef83-42f3-8399-0ac2621e8319", + "text": "We\npresent the main theoretical result of our algorithm below. X ||p −Med(Ti)||2 ⩽Cost(X∗i , c∗i ) (21) Theorem B.1.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 23, + "total_chunks": 39, + "char_count": 116, + "word_count": 20, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "efc4ad73-88af-4657-9ce5-1734982d852f", + "text": "Algorithm 3 is an algorithm for Learningp∈Ni(ˆci)∩Ti Augmented k-means clustering. Given a dataset X ∈Rn×d\nand a partition (X1, . . . , Xk) with an error rate of α < 12,Comprehensive inequalities (18), (20), and (21), we obtain\nthe algorithm outputs a solution with an approximation ra-\n(2 + αϵ)Cost(X∗i , c∗i ) ⩾(1 −2α)|Xi|||ˆci˜ −Med(Ti)||2, tio of 1 + 1−αα + (1−2α)(1−α)4α+αϵ and completes with constant\n(22) probability within O(2O(1/ε)nd log k) time complexity. Table 4: Comparison results of learning augmented k-means algorithms Methods and References Approximation Ratio Label Error Range Time Complexity\nErgun et al. (2022) 1 + 20α [ 10 log m , 1/7] O(nd log n) √m\nNguyen, Chaturvedi, and Nguyen (2023) 1 + 1−αα + (1−2α)(1−α)4α [0, 1/2) O(nd log n)\nHuang et al. (2025) 1 + 1−αα + (1−2α)(1−α)4α+αϵ [0, 1/2) O(ϵ−1/2nd log(kd))\nα 12α−18α2 Huang et al. (2025) 1 + 1−α + (1−3α−ϵ)(1−2α−ϵ) (0, 1/3 −ϵ) O(nd) + ˜O(ϵ−5kd)\nSample-and-Search (ours) 1 + 1−αα + (1−2α)(1−α)4α+αϵ [0, 1/2) O(2O(1/ϵ)nd log k) B.2 Proof of Theorem B.1 We first introduce two well-known and widely used results\nin the field of k-means clustering. Let X ⊆Rd be a set of n points, and\nc ∈Rd. Cost2(X, c) = Cost2(X, Cen2(X)) + n · ∥c −Cen2(X)∥22. For any arbitrary partition X1 ∪X2 of a\nset X ⊆Rd, where X has size n, if |X1| ≥(1 −λ)n, then: λAlgorithm 3: Sample-and-Search for Learning-Augmented ∥Cen2(X), Cen2(X1)∥22 ≤ Cen2(X)).k-means (1 −λ)nCost2(X,\n1: Input:A k-means instance (X, k, d), a set (X1˜ . . . Xk)˜ We also introduce a important propositions on geometric\nof partitions with error rate α, and a parameter ϵ ∈(0, 1) means point in Euclidean space, which are essential for our\n2: Output:A set C ∈Rd of centers with |C| = k. following proofs.\n3: ˆC ←{}\nProposition B.4. (Inaba, Katoh, and Imai 1994) Let S be 4: for i ∈[k] do\na set of m points obtained by independently sampling m\n5: Ci ←{}\npoints uniformly at random from a point set P. Then, for\n6: for j = 1 to ⌈log(δ/k)log 0.75 ⌉do any δ > 0,\n7: Randomly and independently sample a set Rji from\n⩽ϵCost2(Ti, c(Ti)) . Xi˜ with size ⌈ (1−α)ϵ⌉4 ||Cen2(P) −Cen2(Ti)||2\n1 |Ti| 8: for every ⌈ (1−α)ϵ⌉subset R of Rji do\nholds with probability at least 1 −δ. 9: Ci = Ci ∪Cen2(R)\n10: end for Proposition B.4 shows that if a sufficient number of points\n11: end for are sampled randomly from P, then the centroid of the sam-\n12: For each c ∈Ci, define Ni(c) as the set of ⌈(1 − pled points is close to Med(P) with high probability, which\nα)mi⌉points in Xi˜ closest to c. enables us to construct a candidate set of centers by directly\n13: ˆci = arg minc∈C′ Cost2(Ni(c), c) using the centroid of the sampled points.\n14: ˆC = ˆC ∪Cen(Ni(ˆci)) We divide the proof into three main steps: First, we estab-\n15: end for lish that for each predicted cluster, with high probability, the\n16: Return ˆC constructed candidate center set contains at least one point\nthat is close to the true median of the correctly labeled subset\nof the predicted cluster. This is formalized in Lemma B.5,\nwhich leverages the geometric properties from Proposition\nB.4 under our sampling design.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 24, + "total_chunks": 39, + "char_count": 3090, + "word_count": 553, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "41d9ee1f-105b-48c7-a5ee-eb56c42f3a19", + "text": "Second, we analyze the cost\nof the selected center from the candidate set. In Lemma B.7,\nwe show that this center yields a clustering cost close to the\noptimal one, de- spite the noisy labels, by carefully bounding\nthe additional cost incurred by misclassified points and the\noptimality of the greedy choice. Finally, we aggregate the\nbounds over all clusters to obtain the total clustering cost,\nand analyze the size of the candidate set and runtime of our\nalgorithm. For each pridicted cluster Xi,˜ with probability So, we have\n1 −δk, there exists a point q ∈C′ satisfying:\nCost2(Ni(ˆci), ˆci)\n⩽ϵCost2(Ti, c(Ti)) . −α)ni ||q −Cen2(Ti)||2 |Ti| ⩽(1 + ϵ)(1 Cost2(Ti, Cen2(Ti))), (28) |Ti| First, under the learning-augmented setting, we have\nAccording to Proposition B.3, if we set C = Ni(ˆci), C1 =\n|Ti| ⩾(1 −α) max(|Xi|,˜ |Xi|).˜ Ni(ˆci) ∩Ti, as we know\nThen, as we sampled a point set Qi of size (1−α)ϵ4 from Xi,˜ |C1| ⩾|Ni(ˆci)| −|Xi\\Ti| ⩾1 −2α = 1 − αby employing Markov's inequality, we deduce that, C |Ni(ˆci)| 1 −α 1 −α, then it follows that ⩾2 . |Ti ∩Rji|\nϵ α\nCost2(Ni(ˆci), ˆci) 1 ⩽ 1−αwith α |Ni(ˆci)| probability at least 2. It follows that there exist a subset ||ˆci −Cen2(Ni(ˆci) ∩Ti)||22 1 − 1−αR satisfies\n2 αCost2(Ni(ˆci), ˆci) = . (29) . R ⊂Ti, |R| = (1 −2α)(1 −α)ni ϵ\nBy Proposition B.4, let P = Ti, S = R, δ = 2,1 m = 2ϵ , Similarly, if we set C = Ti, C1 = Ni(ˆci) ∩Ti, we can also\nwe have bound the distance between Cen2(Ni(ˆci)∩Ti) and Cen2(Ti)\n⩽ϵCost2(Ti, c(Ti)) . as ||q −Cen2(Ti)||2\n|Ti| ⩽αCost(Ti, Cen2(Ti) , ||Cen2(Ti) −Cen2(Ni(ˆci) ∩Ti)||22\n1 (1 −2α)|Ti|weth probability at least 2. Now, we calculate the probabil- (30)ity that all events succeed in a single trial. The combined\nsuccess probability is 14, Since we performl ⌈log(δ/k)log 0.75 ⌉runs, Based on inequalities (28), (29) and (30), we are able to\nthe overall success probability is therefore greater than bound the distance between ˆci and Cen2(Ti) log(δ/k)\n1 −0.75 log 0.75 = 1 −δ . ||ˆci −Cen2(Ti)||22 ⩽2||ˆci −Cen2(Ni(ˆci) ∩Ti)||22\n+ 2||Cen2(Ni(ˆci) ∩Ti) −Cen2(Ti)||22\nαCost2(Ni(ˆci), ˆci) ⩽2 We now turn to evaluate the clustering cost incurred by (1 −2α)(1 −α)ni\nthe selected centers. Lemma B.7 plays a central role in this\nCen2(Ti)analysis, it quantifies how far the selected center might be + 2αCost2(Ti,\nfrom the true cluster center due to noisy labels and sampling (1 −2α)|Ti|\nvariance, and how this error translates into overall clustering (4α + 2ϵα)Cost2(Ti, Cen2(Ti))\n= .cost. We first establish the following claim. (1 −2α)|Ti| The distance between ˆci and Cen2(Ti) satisfies\nthe following inequality:\n||ˆci −Cen2(Ti)||22 ≤4α + ϵα Cost2(Ti, Cen2(Ti)) . We now turn to evaluate the clustering cost incurred by 1 −2α |Ti| the selected centers.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 25, + "total_chunks": 39, + "char_count": 2741, + "word_count": 466, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "1362a09a-3795-47e7-a033-8f3877322b00", + "text": "Lemma B.7 plays a central role in this\nanalysis—it quantifies how far the selected center might be\nProof.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 26, + "total_chunks": 39, + "char_count": 105, + "word_count": 18, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "b6ec0a77-84e3-40bc-b58f-3145dd826917", + "text": "First, due to our greedy search process, we have from the true cluster center due to noisy labels and sampling\nCost2(Ni(ˆci), ˆci) ⩽Cost2(Ni(q), q). (25) variance, and how this error translates into overall clustering\ncost. As Ni(q) is the set of the nearest points from ˆci of size (1 −\nα)|Xi|, we also have Lemma B.7. For pridicted cluster Xi, we have:\nCost2(Ni(q), q) ⩽Cost2(Ti, q). (26) Cost(X∗i , ˆci)\nLemma B.5 directly yields ⩽ 1 + α + 4α + αϵ Cost(X∗i , c∗i ). 1 −α (1 −2α)(1 −α) Cost2(Ti, q) ⩽(1 + ϵ)Cost2(Ti, Cen2(Ti)). (27)\nSince |Ti| ⩾(1 −α)ni, from the inequalities (25), (26) and Proof. We begin by dividing the calculation of\n(27) presented above, it follows that Cost2(X∗i , c∗i ) into two parts\nCost2(Ni(ˆci), ˆci) Cen2(Ti)) Cost2(X∗i , c∗i ) = Cost2(X∗i \\Ti, c∗i ) + Cost2(Ti, c∗i ). ⩽(1 + ϵ)Cost2(Ti, .\n(1 −α)ni |Ti| (31) According to Proposition B.2, Cost2(Ti, c∗i ) can be written We now proceed to formally prove Theorem 2.1 by estabas lishing both the approximation ratio and the runtime complexity. Cost2(Ti, Cen2(Ti))\nProof of Theorem B.1. Our first step is to compute the ap- i \\Ti| + (1 −|X∗ )|X∗i | × ||c∗i −Cen2(Ti)||22. (32) proximation ratio of the algorithm. In each cluster, by\n|X∗i | Lemma B.7, we obtain\nSimilarly, we have α 4α + αϵ\nCost2(X∗i , ˆci) ⩽ 1 + + Cost2(X∗i , c∗i ).Cost2(X∗i /Ti, c∗i ) = Cost2(X∗i \\Ti, Cen2(X∗i \\Ti)) 1 −α (1 −2α)(1 −α)\n|X∗i \\Ti| (35) + |X∗i | × ||c∗i −Cen2(X∗i \\Ti)||22\n|X∗i | Therefore, for the instance, we have\n(33)\nX Cost2(X∗i , {bcj}kj=1)Combining inequalities 31, 32, and 33 We obtain\ni∈[k]\nCost2(X∗i , c∗i ) = Cost2(X∗i \\Ti, Cen2(X∗i \\Ti)) α 4α + αϵ\n⩽ 1 + + X Cost2(X∗i , c∗i ). 1 −α (1 −2α)(1 −α) |X∗i \\Ti| i∈[k] + |X∗i |||c∗i −Cen2(X∗i /Ti)||22\n|X∗i | (36)\n+ Cost2(Ti, Cen2(Ti))\nNext, we analyze the time complexity of the algorithm. First,\ni \\Ti| + (1 −|X∗ )|X∗i |||c∗i −Cen2(Ti)||22 we compute the size of set of candidate centers. The1 total\nε(1−α) ) ⩽ |X∗i | count of subsets of S with a fixed size is O(\n1/ϵ\ni \\Ti| 1 −|X∗|X∗ i | 2O(1/ϵ), The time required to construct the candidate set is = |X∗i |||c∗i −Cen2(Ti)||22 |X∗i \\Ti| For each candidate point within the candidate\n|X∗i | center set, the time needed to calculate its cost is nid. Con-\n+ Cost2(X∗i /Ti, Cen2(X∗i /Ti)) sequently, the overall time complexity of the algorithm is\n+ Cost2(Ti, Cen2(Ti)) k k\nX 2O(1/ε)|Xi|d˜ log = 2O(1/ε)nd log . (37) −α δ δ ⩾1 |X∗i |||c∗i −Cen2(Ti)||22 i∈[k] α\n+ Cost2(Ti, Cen2(Ti)) Then, we analyze the success probability of the algorithm. By Lemma B.5 the success probability within a single clus- ⩾1 −α |X∗i |||c∗i −Cen2(Ti)||22 ter is By the union bound, the overall success proba- 1 −kδ . α\n1 −2α bility of the algorithm ⩾1 −k × kδ = 1 −δ. + · (1 −α)|X∗˜i |||ˆci −Cen2(Ti)||22, 4α + 2ϵα\n(34)", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 27, + "total_chunks": 39, + "char_count": 2762, + "word_count": 514, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "7f7c840e-e9a1-42a3-a7b9-881a591d6d47", + "text": "Finally, by the Cauchy-Schwarz inequality, we obtain\n(||c∗i −Cen2(Ti)||2 + ||ˆci −Cen2(Ti)||2)2\nα 4α + 2ϵα\n≤ +\n1 −α (1 −2α)(1 −α)\n−α × (1 |X∗i | × ||c∗i −Cen2(Ti)||22\n1 −2α\n+ · (1 −α)|X∗i | × ||ˆci −Cen2(Ti)||22)/|X∗i |\n4α + 2ϵα\nα 4α + 2ϵα\n≤( + i , c∗i )/|X∗i |. 1 −α (1 −2α)(1 −α))Cost2(X∗ This directly yields\n||ˆci −Cen2(X∗i ||22\nα 4α + 2ϵα\n⩽Cost2(X∗i , c∗i ) + /|X∗i |. 1 −α (1 −2α)(1 −α) By Proposition B.2, this is equivalent to what we aim to\nprove.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 28, + "total_chunks": 39, + "char_count": 456, + "word_count": 95, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "3d2e1840-1402-4238-825d-0eb4f31cd3bf", + "text": "C Additional experiment for\nLearning-Augment k-median\nTables 5-8 show the experimental results on datasets CIFAR-\n10, Fashion-Mnist, PHY and Mnist, for varying α with fixed\nk. Table 9-11 shows our results for varying k with fixed α. Both sets of results demonstrate that our algorithm is substantially faster than all competing methods while generally\nachieving better approximation quality ,particularly on highdimensional datasets. D Experiment for Learning-Augment\nk-means\nWe evaluated our algorithms on real-world datasets. The\nexperiments were conducted on a server with an Intel(R)\nXeon(R) Gold 6154 CPU and 1024GB of RAM.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 29, + "total_chunks": 39, + "char_count": 628, + "word_count": 92, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "db5ef5d4-1fbe-44f9-ac7b-a46cd7e7b5a8", + "text": "For all experiments, we report the average clustering cost and its standard deviation over 10 independent runs. Following the work of Nguyen, Chaturvedi,\nand Nguyen (2023), Ergun et al. (2022) and Huang et al.\n(2025), we evaluate our algorithms on the CIFAR-10 (n =\n50, 000, d = 3, 072) (Krizhevsky and Hinton 2009), PHY\n(n = 10, 000, d = 50) (KDD 2004), and MNIST (n =\n1, 797, d = 64) (Deng 2012) datasets using a range of error rates α. We additionally evaluated our algorithm's performance on the Fashion-MNIST (n = 60000, d = 784)\n(Xiao, Rasul, and Vollgraf 2017) dataset to assess its efficacy in high-dimensional datasets.\"\nPredictor Generation and Error Simulation To evaluate our algorithms, we first computed a ground-truth partition for each dataset using Lloyd's algorithm initialized with\nKMeans++(denoteed as KMe++). We then generated corrupted partitions with the error rate, α, by randomly selecting an α fraction of points in each true cluster and reassigning them to randomly chosen cluster(denoted as Predictor). To ensure a fair comparison, every algorithm was tested on\nthe exact same set of corrupted labels for any given error rate\nAlgorithms In our experiments, we evaluate our proposed Sample-and-Search algorithm . We compare their performance against other state-of-the-art learning-augmented\nmethods, including the algorithm from Ergun et al.\n(2022)(denoted as Erg),Nguyen, Chaturvedi, and Nguyen\n(2023)(denoted as Ngu) and the recent work by Huang et al.\n(2025)(denoted as Fast-Sampling,Fast-Estimation and FastFiltering).", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 30, + "total_chunks": 39, + "char_count": 1550, + "word_count": 239, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "372a39cf-e1e8-41ec-b8c1-48ef2e6329e7", + "text": "We utilize the cost calculated from the labels generated by our predictor as the baseline. The cost computed\nfrom the undegraded labels is considered the optimal cost. Our primary comparison focuses on the clustering cost and\nalgorithm runtime on the given dataset. Furthermore, we\nalso computed the standard deviation to assess the stability\nof the algorithms\nResults. Table 12-15 show the experimental results on\ndatasets CIFAR-10, Fashion-Mnist, PHY and Mnist, for\nvarying α with fixed k. Table 16-18 shows our results for\nvarying k with fixed α. Both sets of results demonstrate\nthat our algorithm is a bit slower than Fast-Filtering method\nwhile generally achieving better approximation quality ,particularly on high-dimensional datasets. Table 5: Performance comparison on Cifar10 dataset with fixed k = 10 and varied α. Condition Cost Time(s) NMI ARI 0 K-Med++ 2.7492e+07 - - - - - - - Predictor 2.7528e+07 - - - - - - -\nEFSplus 2.7406e+07 1109.14 29.87 14.76 0.6631 0.0012 0.5918 0.0026\n0.1 HFHplus 2.7401e+07 3929.91 184.22 1.45 0.6562 0.0078 0.5862 0.0107\nNCN 2.7404e+07 3466.13 34.43 7.95 0.6588 0.0050 0.5862 0.0065\nOurs 2.7397e+07 5089.44 14.16 0.26 0.6538 0.0090 0.5815 0.0131 Predictor 2.7626e+07 - - - - - - -\nEFSplus 2.7418e+07 1881.31 33.79 15.96 0.6571 0.0025 0.5737 0.0043\n0.2 HFHplus 2.7403e+07 6017.94 184.37 1.69 0.6548 0.0082 0.5796 0.0113\nNCN 2.7415e+07 2612.95 32.81 6.77 0.6506 0.0079 0.5671 0.0099\nOurs 2.7400e+07 4507.52 14.63 0.26 0.6560 0.0050 0.5800 0.0078 Predictor 2.7906e+07 - - - - - - -\nEFSplus 2.7460e+07 2413.62 35.60 19.24 0.6519 0.0046 0.5574 0.0061\n0.3 HFHplus 2.7413e+07 7979.19 185.80 1.97 0.6498 0.0065 0.5678 0.0093\nNCN 2.7454e+07 9237.83 32.87 8.06 0.6513 0.0051 0.5633 0.0072\nOurs 2.7409e+07 4717.90 14.27 0.36 0.6546 0.0060 0.5728 0.0105 Predictor 2.8316e+07 - - - - - - -\nEFSplus 2.7540e+07 3959.35 31.26 20.27 0.6302 0.0046 0.5344 0.0064\n0.4 HFHplus 2.7439e+07 27686.32 186.90 1.83 0.6375 0.0101 0.5444 0.0163\nNCN 2.7505e+07 18855.20 35.30 5.75 0.6286 0.0106 0.5306 0.0154\nOurs 2.7425e+07 10487.16 14.22 0.43 0.6447 0.0082 0.5556 0.0135 Predictor 2.8889e+07 - - - - - - -\nEFSplus 2.7584e+07 4907.95 25.72 18.49 0.6179 0.0039 0.5162 0.0048\n0.5 HFHplus 2.7488e+07 43919.20 185.27 1.69 0.6300 0.0139 0.5273 0.0186\nNCN 2.7569e+07 27179.60 39.10 11.63 0.6152 0.0135 0.5130 0.0196\nOurs 2.7481e+07 26429.82 13.84 0.35 0.6341 0.0071 0.5311 0.0141", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 31, + "total_chunks": 39, + "char_count": 2389, + "word_count": 372, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "36a770ea-3ba9-4d73-a44e-e5ec5d5c7101", + "text": "Condition Cost Time(s) NMI ARI 0 K-Med++ 8.4054e+07 - - - - - - - Predictor 8.4259e+07 - - - - - - -\nEFSplus 8.4050e+07 115.17 270.47 12.97 0.9558 0.0003 0.9642 0.0003\n0.1 HFHplus 8.4049e+07 834.92 749.72 18.47 0.9528 0.0036 0.9591 0.0050\nNCN 8.4050e+07 181.80 272.22 4.37 0.9552 0.0008 0.9636 0.0007\nOurs 8.4048e+07 933.64 47.37 0.78 0.9516 0.0019 0.9581 0.0022 Predictor 8.4935e+07 - - - - - - -\nEFSplus 8.4057e+07 287.83 283.13 24.07 0.9490 0.0006 0.9573 0.0006\n0.2 HFHplus 8.4053e+07 1598.91 751.66 25.42 0.9433 0.0065 0.9493 0.0097\nNCN 8.4057e+07 309.52 282.97 13.78 0.9486 0.0006 0.9568 0.0006\nOurs 8.4052e+07 961.03 47.96 3.33 0.9453 0.0043 0.9515 0.0066 Predictor 8.6223e+07 - - - - - - -\nEFSplus 8.4076e+07 467.75 282.57 8.66 0.9419 0.0005 0.9486 0.0006\n0.3 HFHplus 8.4065e+07 3527.00 751.13 22.67 0.9333 0.0058 0.9387 0.0088\nNCN 8.4077e+07 695.49 299.50 22.54 0.9414 0.0008 0.9481 0.0010\nOurs 8.4062e+07 3848.20 45.38 1.33 0.9362 0.0067 0.9420 0.0105 Predictor 8.8209e+07 - - - - - - -\nEFSplus 8.4109e+07 631.67 297.27 13.66 0.9348 0.0005 0.9380 0.0007\n0.4 HFHplus 8.4101e+07 11512.25 758.16 29.89 0.9331 0.0022 0.9391 0.0021\nNCN 8.4111e+07 1206.20 302.45 25.71 0.9330 0.0050 0.9359 0.0064\nOurs 8.4100e+07 12684.42 45.29 2.18 0.9320 0.0041 0.9374 0.0029 Predictor 9.0897e+07 - - - - - - -\nEFSplus 8.4150e+07 1320.47 304.67 11.82 0.9185 0.0007 0.9191 0.0009\n0.5 HFHplus 8.4148e+07 12671.01 751.08 26.53 0.9176 0.0010 0.9195 0.0019\nNCN 8.4152e+07 2503.70 305.95 10.98 0.9178 0.0010 0.9183 0.0014\nOurs 8.4145e+07 17385.62 47.87 2.04 0.9201 0.0048 0.9224 0.0066 Table 7: Performance comparison on PHY dataset with fixed k = 10 and varied α. Condition Cost Time(s) NMI ARI 0 K-Med++ 1.9797e+08 - - - - - - - Predictor 1.9849e+08 - - - - - - -\nEFSplus 1.9800e+08 478.11 392.68 99.57 0.9882 0.0001 0.9900 0.0001\n0.1 HFHplus 1.9799e+08 6778.35 39.54 6.32 0.9896 0.0017 0.9913 0.0017\nNCN 1.9800e+08 4363.70 376.60 102.88 0.9883 0.0005 0.9900 0.0005\nOurs 1.9798e+08 7427.30 34.05 2.86 0.9867 0.0049 0.9884 0.0053 Predictor 2.0034e+08 - - - - - - -\nEFSplus 1.9813e+08 610.96 360.84 114.79 0.9789 0.0002 0.9799 0.0002\n0.2 HFHplus 1.9797e+08 1449.20 35.91 7.18 0.9934 0.0010 0.9950 0.0009\nNCN 1.9807e+08 26763.49 262.61 94.40 0.9816 0.0039 0.9824 0.0049\nOurs 1.9797e+08 522.96 32.72 3.64 0.9940 0.0005 0.9954 0.0004 Predictor 2.0446e+08 - - - - - - -\nEFSplus 1.9844e+08 2913.39 357.00 93.67 0.9710 0.0002 0.9690 0.0002\n0.3 HFHplus 1.9798e+08 7618.53 37.66 5.30 0.9889 0.0032 0.9903 0.0036\nNCN 1.9832e+08 72445.21 234.39 82.27 0.9739 0.0027 0.9724 0.0031\nOurs 1.9798e+08 3555.61 27.97 1.55 0.9919 0.0022 0.9931 0.0025 Predictor 2.1242e+08 - - - - - - -\nEFSplus 1.9896e+08 4654.51 304.39 97.99 0.9614 0.0002 0.9558 0.0002\n0.4 HFHplus 1.9799e+08 7119.24 35.93 5.46 0.9885 0.0029 0.9903 0.0031\nNCN 1.9874e+08 146423.75 225.87 80.83 0.9633 0.0027 0.9588 0.0036\nOurs 1.9798e+08 5888.86 28.25 1.95 0.9890 0.0031 0.9908 0.0034 Predictor 2.2831e+08 - - - - - - -\nEFSplus 2.0004e+08 8404.95 306.93 83.25 0.9500 0.0002 0.9380 0.0002\n0.5 HFHplus 1.9799e+08 15837.90 35.58 2.82 0.9879 0.0029 0.9896 0.0026\nNCN 1.9976e+08 399881.55 262.76 91.23 0.9525 0.0042 0.9422 0.0062\nOurs 1.9798e+08 3841.72 26.50 0.81 0.9893 0.0014 0.9908 0.0012", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 33, + "total_chunks": 39, + "char_count": 3218, + "word_count": 505, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "881ff229-9978-46ee-82d9-981558588f52", + "text": "Table 8: Performance comparison on MNIST dataset with fixed k = 10 and varied α. Condition Cost Time(s) NMI ARI 0 K-Med++ 44696.31 - - - - - - - Predictor 44795.92 - - - - - - -\nEFSplus 44695.47 0.78 25.23 6.42 0.9721 0.0030 0.9722 0.0038\n0.1 HFHplus 44695.66 0.40 3.58 1.09 0.9682 0.0069 0.9688 0.0071\nNCN 44695.55 0.47 24.97 6.58 0.9739 0.0053 0.9746 0.0056\nOurs 44695.31 0.71 3.36 0.91 0.9688 0.0072 0.9692 0.0072 Predictor 45104.16 - - - - - - -\nEFSplus 44698.38 0.57 26.28 7.15 0.9695 0.0056 0.9718 0.0053\n0.2 HFHplus 44697.47 0.45 3.56 1.29 0.9705 0.0031 0.9737 0.0036\nNCN 44698.24 0.18 27.18 7.91 0.9690 0.0014 0.9718 0.0017\nOurs 44697.18 0.30 3.40 0.87 0.9727 0.0021 0.9761 0.0024 Predictor 45654.02 - - - - - - -\nEFSplus 44706.18 1.20 25.54 7.29 0.9639 0.0059 0.9662 0.0058\n0.3 HFHplus 44700.07 1.48 3.47 1.09 0.9618 0.0059 0.9633 0.0057\nNCN 44706.96 1.15 25.83 8.28 0.9618 0.0036 0.9639 0.0036\nOurs 44699.50 1.24 3.05 0.55 0.9614 0.0033 0.9633 0.0031 Predictor 46620.32 - - - - - - -\nEFSplus 44704.53 1.16 27.35 6.00 0.9681 0.0036 0.9687 0.0048\n0.4 HFHplus 44704.70 1.73 3.82 1.16 0.9560 0.0086 0.9554 0.0106\nNCN 44706.67 1.27 27.65 7.45 0.9656 0.0049 0.9673 0.0058\nOurs 44701.89 1.63 3.39 1.03 0.9606 0.0066 0.9608 0.0078 Predictor 48022.97 - - - - - - -\nEFSplus 44717.07 3.38 28.34 7.28 0.9538 0.0037 0.9537 0.0042\n0.5 HFHplus 44718.66 2.15 3.63 1.16 0.9451 0.0065 0.9446 0.0080\nNCN 44720.35 1.46 26.36 7.25 0.9501 0.0031 0.9502 0.0039\nOurs 44716.27 5.17 3.40 0.96 0.9446 0.0081 0.9439 0.0102 Table 9: Performance comparison on Fashion-MNIST dataset with fixed α = 0.2 and varied k.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 34, + "total_chunks": 39, + "char_count": 1594, + "word_count": 275, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "6160bc5e-ed3e-4fbd-b8ce-1acbdfdb94b6", + "text": "Condition Cost Time(s) NMI ARI K-Med++ 8.4058e+07 - - - - - - -\nPredictor 8.4938e+07 - - - - - - -\nEFSplus 8.4059e+07 305.75 281.82 11.34 0.9460 0.0005 0.9547 0.0006\n10 HFHplus 8.4053e+07 1129.16 733.73 24.48 0.9419 0.0035 0.9489 0.0057\nNCN 8.4059e+07 811.26 285.90 2.63 0.9455 0.0015 0.9541 0.0016\nOurs 8.4052e+07 1275.81 60.78 3.40 0.9473 0.0025 0.9557 0.0030 K-Med++ 7.6802e+07 - - - - - - -\nPredictor 7.7606e+07 - - - - - - -\nEFSplus 7.6802e+07 294.01 244.01 2.30 0.9466 0.0006 0.9348 0.0008\n20 HFHplus 7.6792e+07 1249.97 945.64 3.66 0.9507 0.0028 0.9402 0.0048\nNCN 7.6802e+07 234.77 268.17 21.22 0.9458 0.0005 0.9335 0.0008\nOurs 7.6791e+07 951.50 78.03 0.57 0.9510 0.0022 0.9400 0.0035 K-Med++ 7.3922e+07 - - - - - - -\nPredictor 7.4732e+07 - - - - - - -\nEFSplus 7.3935e+07 339.73 238.19 2.97 0.9476 0.0006 0.9308 0.0010\n30 HFHplus 7.3915e+07 1969.80 1185.44 16.58 0.9516 0.0034 0.9359 0.0066\nNCN 7.3935e+07 1675.89 230.32 44.44 0.9446 0.0043 0.9258 0.0070\nOurs 7.3914e+07 388.43 100.95 1.95 0.9526 0.0011 0.9371 0.0018 K-Med++ 7.1839e+07 - - - - - - -\nPredictor 7.2598e+07 - - - - - - -\nEFSplus 7.1853e+07 621.20 237.52 3.37 0.9430 0.0005 0.9212 0.0006\n40 HFHplus 7.1829e+07 902.75 1451.52 18.02 0.9507 0.0020 0.9309 0.0039\nNCN 7.1853e+07 299.48 245.99 3.36 0.9425 0.0003 0.9203 0.0004\nOurs 7.1828e+07 905.20 119.70 0.57 0.9522 0.0016 0.9322 0.0035 K-Med++ 7.0470e+07 - - - - - - -\nPredictor 7.1242e+07 - - - - - - -\nEFSplus 7.0494e+07 200.76 232.55 2.20 0.9465 0.0012 0.9207 0.0025\n50 HFHplus 7.0466e+07 654.43 1759.73 20.54 0.9544 0.0009 0.9337 0.0013\nNCN 7.0495e+07 530.36 257.05 26.50 0.9457 0.0002 0.9196 0.0004\nOurs 7.0465e+07 820.99 140.33 1.37 0.9574 0.0012 0.9377 0.0017 Table 10: Performance comparison on PHY dataset with fixed α = 0.2 and varied k. Condition Cost Time(s) NMI ARI K-Med++ 2.0224e+08 - - - - - - -\nPredictor 2.0428e+08 - - - - - - -\nEFSplus 2.0205e+08 4444.80 362.31 52.30 0.9369 0.0002 0.9192 0.0003\nHFHplus 2.0147e+08 105661.96 42.33 1.91 0.9289 0.0139 0.9036 0.0240\nNCN 2.0163e+08 109131.95 160.15 14.33 0.9173 0.0097 0.8866 0.0150\nOurs 2.0135e+08 82812.18 20.72 0.59 0.8934 0.0222 0.8446 0.0360 K-Med++ 1.0910e+08 - - - - - - -\nPredictor 1.1004e+08 - - - - - - -\nEFSplus 1.0927e+08 1096.15 335.96 46.37 0.9590 0.0002 0.9416 0.0004\nHFHplus 1.0910e+08 1141.88 40.96 2.95 0.9892 0.0095 0.9873 0.0148\nNCN 1.0923e+08 23983.20 225.00 47.90 0.9630 0.0011 0.9473 0.0016\nOurs 1.0910e+08 4120.98 25.49 0.90 0.9807 0.0075 0.9737 0.0117 K-Med++ 8.4404e+07 - - - - - - -\nPredictor 8.5019e+07 - - - - - - -\nEFSplus 8.4491e+07 721.59 294.21 54.13 0.9514 0.0003 0.9202 0.0007\nHFHplus 8.4404e+07 4372.98 42.26 2.05 0.9856 0.0048 0.9810 0.0107\nNCN 8.4480e+07 14266.80 221.81 49.30 0.9541 0.0037 0.9240 0.0079\nOurs 8.4404e+07 3043.38 27.08 3.26 0.9857 0.0024 0.9809 0.0051 K-Med++ 6.9910e+07 - - - - - - -\nPredictor 7.0361e+07 - - - - - - -\nEFSplus 7.0002e+07 527.34 287.23 23.10 0.9535 0.0003 0.9153 0.0007\nHFHplus 6.9910e+07 4461.84 44.65 2.59 0.9894 0.0027 0.9852 0.0062\nNCN 6.9989e+07 9213.02 209.77 35.93 0.9571 0.0028 0.9224 0.0066\nOurs 6.9909e+07 1497.41 32.51 0.63 0.9901 0.0006 0.9867 0.0010 K-Med++ 6.2759e+07 - - - - - - -\nPredictor 6.3112e+07 - - - - - - -\nEFSplus 6.2797e+07 503.13 285.51 22.48 0.9510 0.0004 0.9025 0.0007\nHFHplus 6.2756e+07 1072.71 44.69 1.34 0.9834 0.0020 0.9758 0.0038\nNCN 6.2792e+07 5662.71 208.89 29.26 0.9528 0.0013 0.9062 0.0030\nOurs 6.2755e+07 677.03 36.87 0.73 0.9842 0.0009 0.9776 0.0016 Table 11: Performance comparison on MNIST dataset with fixed α = 0.2 and varied k. Condition Cost Time(s) NMI ARI", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 35, + "total_chunks": 39, + "char_count": 3558, + "word_count": 590, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "ac89f678-ab54-478f-a73d-6d0c305c761b", + "text": "K-Med++ 44695.78 - - - 1.0000 - 1.0000 -\nPredictor 45101.25 - - - - - - -\nNCN 44700.26 0.66 4.68 0.36 0.9738 0.0031 0.9768 0.0026\nFast-Estimation 44698.33 0.78 2.18 0.15 0.9655 0.0037 0.9670 0.0044\nNgu 44701.73 0.97 4.53 0.44 0.9745 0.0016 0.9774 0.0013\nOurs 44697.49 0.58 0.28 0.03 0.9659 0.0028 0.9677 0.0024 K-Med++ 40348.81 - - - 1.0000 - 1.0000 -\nPredictor 40937.18 - - - - - - -\nNCN 40347.92 1.70 5.10 0.34 0.9643 0.0039 0.9619 0.0043\nFast-Estimation 40344.74 0.81 3.65 0.10 0.9546 0.0177 0.9466 0.0246\nNgu 40345.16 0.76 4.83 0.37 0.9633 0.0009 0.9605 0.0012\nOurs 40344.54 1.02 0.35 0.04 0.9640 0.0021 0.9599 0.0026 K-Med++ 37920.28 - - - 1.0000 - 1.0000 -\nPredictor 38506.77 - - - - - - -\nNCN 37900.65 2.17 5.60 0.46 0.9506 0.0019 0.9376 0.0029\nFast-Estimation 37885.34 7.12 5.02 0.34 0.9439 0.0131 0.9236 0.0223\nNgu 37898.98 1.35 5.00 0.51 0.9523 0.0053 0.9415 0.0075\nOurs 37893.81 3.94 0.39 0.06 0.9481 0.0016 0.9309 0.0041 K-Med++ 36419.10 - - - 1.0000 - 1.0000 -\nPredictor 36981.30 - - - - - - -\nNCN 36398.94 4.92 6.28 0.46 0.9498 0.0049 0.9185 0.0086\nFast-Estimation 36381.18 7.13 6.23 1.08 0.9232 0.0083 0.8667 0.0135\nNgu 36405.05 8.05 4.74 0.47 0.9458 0.0098 0.9120 0.0190\nOurs 36394.45 3.43 0.43 0.05 0.9360 0.0193 0.8887 0.0350 K-Med++ 35078.76 - - - 1.0000 - 1.0000 -\nPredictor 35676.22 - - - - - - -\nNCN 35058.31 5.67 6.68 0.44 0.9561 0.0016 0.9127 0.0051\nFast-Estimation 35047.58 10.19 7.64 1.05 0.9472 0.0135 0.8874 0.0317\nNgu 35059.27 7.08 5.88 0.89 0.9603 0.0079 0.9188 0.0190\nOurs 35044.96 13.82 0.51 0.04 0.9448 0.0136 0.8852 0.0277 Table 12: Performance comparison on CIFAR-10 dataset with fixed k = 10 and varied α. Condition Cost Time(s) NMI ARI 0 KMe++ 3.9481e+11 - - - - - - - Predictor 3.9680e+11 - - - - - - -\nEFSplus 4.2240e+11 6.5569e+07 7.86 0.22 0.8930 0.0004 0.8853 0.0005\nNCN 3.9869e+11 0 127.44 6.20 0.8980 0.0000 0.8943 0.0000\n0.1 Fast-Estimation 3.9747e+11 5.5532e+08 7.62 0.29 0.8777 0.0093 0.8681 0.0122\nFast-Filtering 3.9680e+11 2.5731e+08 0.73 0.07 0.9201 0.0064 0.9245 0.0073\nFast-Sampling 4.0572e+11 5.2472e+07 61.94 4.30 0.9132 0.0003 0.9149 0.0003\nOurs 3.9522e+11 2.8906e+07 1.86 0.08 0.9315 0.0018 0.9366 0.0024 Predictor 4.0312e+11 - - - - - - -\nEFSplus 4.0320e+11 -e+07 7.70 0.49 0.8539 0.0004 0.8348 0.0005\nNCN 4.0163e+11 0 243.57 10.32 0.8463 0.0000 0.8260 0.0000\n0.2 Fast-Estimation 4.0309e+11 3.2826e+09 7.23 0.23 0.8068 0.0259 0.7661 0.0394\nFast-Filtering 4.0181e+11 4.4826e+08 0.72 0.05 0.8634 0.0156 0.8548 0.0198\nFast-Sampling 4.1658e+11 6.9052e+07 61.60 1.58 0.8220 0.0009 0.7897 0.0012\nOurs 3.9625e+11 1.6161e+08 1.56 0.10 0.8901 0.0051 0.8888 0.0064 Predictor 4.1417e+11 - - - - - - -\nEFSplus 4.1822e+11 7.2262e+07 8.03 0.22 0.7860 0.0004 0.7272 0.0006\nNCN 4.0506e+11 0 349.53 10.41 0.7959 0.0000 0.7530 0.0000\n0.3 Fast-Estimation 4.1693e+11 5.0344e+09 7.38 0.18 0.7432 0.0294 0.6692 0.0466\nFast-Filtering 4.0731e+11 5.4559e+08 0.74 0.04 0.8202 0.0178 0.7967 0.0257\nFast-Sampling 4.3372e+11 1.9210e+08 65.39 2.72 0.8064 0.0007 0.7695 0.0010\nOurs 3.9831e+11 4.9532e+08 1.70 0.14 0.8500 0.0071 0.8350 0.0095 Predictor 4.2775e+11 - - - - - - -\nEFSplus 2.8665e+12 1.1421e+08 7.71 0.37 0.7448 0.0003 0.6573 0.0006\nNCN 4.1171e+11 0 466.82 11.00 0.7494 0.0000 0.6767 0.0000\n0.4 Fast-Estimation 4.2390e+11 3.2667e+09 7.36 0.27 0.6466 0.0356 0.5131 0.0592\nFast-Filtering 4.1778e+11 4.8258e+09 0.73 0.05 0.7700 0.0187 0.7180 0.0312\nFast-Sampling 4.6080e+11 3.8768e+08 61.84 3.07 0.7606 0.0006 0.6852 0.0008\nOurs 4.0041e+11 5.0524e+08 1.65 0.21 0.8065 0.0071 0.7701 0.0110 Predictor 4.4860e+11 - - - - - - -\nEFSplus 2.8674e+12 8.4072e+07 7.76 0.30 0.7032 0.0003 0.5797 0.0006\nNCN 4.2824e+11 0 583.26 10.26 0.7093 0.0000 0.5972 0.0000\n0.5 Fast-Estimation 4.4518e+11 4.4860e+09 7.54 0.08 0.6389 0.0450 0.5000 0.0636\nFast-Filtering 4.2962e+11 7.4925e+09 0.74 0.05 0.7040 0.0276 0.6160 0.0475\nFast-Sampling 4.4860e+11 0 63.12 1.12 0.7136 0.0006 0.5971 0.0009\nOurs 4.1955e+11 7.6168e+09 1.70 0.14 0.7346 0.0113 0.6551 0.0213 Table 13: Performance comparison on Fashion-MNIST dataset with fixed k = 10 and varied α.", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 36, + "total_chunks": 39, + "char_count": 4060, + "word_count": 635, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "41c9953a-4365-49ed-9ca4-e46d59f9bbdd", + "text": "Condition Cost Time(s) NMI ARI 0 K-Med++ 1.2450e+11 - - - - - - - Predictor 1.2613e+11 - - - - - - -\nEFSplus 1.2669e+11 7.7786e+06 2.917 0.556 0.8991 0.0002 0.8898 0.0004\nNCN 1.2567e+11 0.00 3.367 0.814 0.9510 0.0000 0.9570 0.0000\n0.1 Fast-Estimation 1.2575e+11 8.3179e+07 2.637 0.791 0.9325 0.0115 0.9339 0.0142\nFast-Filtering 1.2481e+11 3.0078e+07 0.731 0.319 0.9588 0.0046 0.9650 0.0056\nFast-Sampling 1.2541e+11 2.4393e+06 16.926 2.813 0.9514 0.0002 0.9580 0.0002\nOurs 1.2470e+11 5.6813e+06 0.909 0.316 0.9658 0.0010 0.9724 0.0010 Predictor 1.3058e+11 - - - - - - -\nEFSplus 1.3062e+11 1.0415e+07 2.900 0.528 0.8823 0.0002 0.8713 0.0003\nNCN 1.2814e+11 0.00 7.826 2.057 0.9028 0.0000 0.8978 0.0000\n0.2 Fast-Estimation 1.2863e+11 4.7944e+08 2.599 0.933 0.9040 0.0149 0.8999 0.0220\nFast-Filtering 1.2551e+11 1.4703e+08 0.695 0.319 0.9312 0.0076 0.9363 0.0097\nFast-Sampling 1.2883e+11 6.1020e+06 17.190 3.196 0.9213 0.0003 0.9227 0.0003\nOurs 1.2524e+11 4.3570e+07 0.802 0.247 0.9427 0.0014 0.9497 0.0017 Predictor 1.3752e+11 - - - - - - -\nEFSplus 1.3721e+11 1.7567e+07 3.031 0.608 0.8544 0.0004 0.8328 0.0005\nNCN 1.3255e+11 0.00 7.497 1.342 0.8886 0.0000 0.8817 0.0000\n0.3 Fast-Estimation 1.3293e+11 5.8739e+08 2.744 0.714 0.8656 0.0123 0.8473 0.0181\nFast-Filtering 1.2639e+11 3.5963e+08 0.713 0.384 0.9110 0.0081 0.9139 0.0117\nFast-Sampling 1.3347e+11 6.5379e+07 15.654 2.612 0.8726 0.0012 0.8529 0.0023\nOurs 1.2608e+11 5.0719e+07 0.746 0.349 0.9240 0.0055 0.9300 0.0072 Predictor 1.4681e+11 - - - - - - -\nEFSplus 1.4641e+11 3.1038e+07 3.008 0.505 0.8405 0.0004 0.8136 0.0007\nNCN 1.3952e+11 0.00 11.137 1.466 0.8567 0.0000 0.8329 0.0000\n0.4 Fast-Estimation 1.3915e+11 1.3205e+09 2.469 0.498 0.8398 0.0153 0.8101 0.0234\nFast-Filtering 1.2779e+11 9.4167e+08 0.708 0.376 0.8825 0.0169 0.8772 0.0241\nFast-Sampling 1.3831e+11 1.0812e+08 15.036 3.219 0.8629 0.0009 0.8422 0.0014\nOurs 1.2741e+11 2.8672e+08 0.706 0.264 0.9000 0.0075 0.9019 0.0109 Predictor 1.5849e+11 - - - - - - -\nEFSplus 1.5808e+11 2.9992e+07 2.845 0.640 0.8122 0.0003 0.7728 0.0005\nNCN 1.4889e+11 0.00 11.449 1.345 0.8329 0.0000 0.7991 0.0000\n0.5 Fast-Estimation 1.4781e+11 1.4452e+09 2.508 0.530 0.8090 0.0270 0.7711 0.0386\nFast-Filtering 1.3227e+11 1.7529e+09 0.663 0.283 0.8445 0.0185 0.8349 0.0252\nFast-Sampling 1.4682e+11 1.2533e+08 15.540 2.515 0.8441 0.0012 0.8168 0.0021\nOurs 1.3183e+11 6.7374e+08 0.721 0.188 0.8555 0.0043 0.8510 0.0060 Table 14: Performance comparison on MNIST dataset with fixed k = 10 and varied α. Condition Cost Time(s) NMI ARI 0 KMe++ 1.1653e+06 - - - - - - - Predictor 1.1790e+06 - - - - - - -\nEFSplus 1.1867e+06 1057.86 0.080 0.050 0.9687 0.0034 0.9714 0.0033\nNCN 1.1747e+06 0.00 0.057 0.035 0.9778 0.0000 0.9801 0.0000\n0.1 Fast-Estimation 1.1757e+06 315.60 0.087 0.037 0.9770 0.0017 0.9793 0.0021\nFast-Filtering 1.1683e+06 251.37 0.013 0.013 0.9775 0.0043 0.9803 0.0043\nFast-Sampling 1.1737e+06 216.39 2.199 0.545 0.9782 0.0027 0.9805 0.0024\nOurs 1.1677e+06 0.00 0.016 0.014 0.9825 0.0000 0.9856 0.0000 Predictor 1.2104e+06 - - - - - - -\nEFSplus 1.2036e+06 1605.93 0.070 0.032 0.9492 0.0035 0.9523 0.0041\nNCN 1.1937e+06 0.00 0.124 0.042 0.9545 0.0000 0.9579 0.0000\n0.2 Fast-Estimation 1.1946e+06 1497.59 0.100 0.032 0.9574 0.0049 0.9609 0.0057\nFast-Filtering 1.1719e+06 1187.77 0.013 0.013 0.9677 0.0069 0.9702 0.0066\nFast-Sampling 1.1974e+06 637.34 2.269 0.463 0.9583 0.0023 0.9610 0.0024\nOurs 1.1703e+06 0.00 0.023 0.018 0.9687 0.0000 0.9713 0.0000 Predictor 1.2643e+06 - - - - - - -\nEFSplus 1.2493e+06 3918.83 0.087 0.040 0.9294 0.0051 0.9295 0.0048\nNCN 1.2223e+06 0.00 0.129 0.037 0.9383 0.0000 0.9391 0.0000\n0.3 Fast-Estimation 1.2309e+06 4058.38 0.106 0.024 0.9336 0.0077 0.9345 0.0084\nFast-Filtering 1.1786e+06 1892.68 0.020 0.012 0.9411 0.0066 0.9438 0.0074\nFast-Sampling 1.2438e+06 547.48 2.234 0.428 0.9383 0.0018 0.9390 0.0023\nOurs 1.1770e+06 0.00 0.029 0.011 0.9432 0.0000 0.9455 0.0000 Predictor 1.3324e+06 - - - - - - -\nEFSplus 1.2890e+06 4008.71 0.103 0.042 0.9208 0.0057 0.9186 0.0079\nNCN 1.2663e+06 0.00 0.141 0.030 0.9209 0.0000 0.9198 0.0000\n0.4 Fast-Estimation 1.2804e+06 5819.88 0.096 0.037 0.9231 0.0085 0.9227 0.0098\nFast-Filtering 1.1837e+06 1949.92 0.011 0.011 0.9340 0.0097 0.9333 0.0121\nFast-Sampling 1.2920e+06 7189.59 1.681 0.544 0.9160 0.0054 0.9151 0.0056\nOurs 1.1815e+06 0.00 0.012 0.014 0.9490 0.0000 0.9503 0.0000 Predictor 1.4340e+06 - - - - - - -\nEFSplus 1.3886e+06 5775.49 0.122 0.026 0.8939 0.0057 0.8887 0.0077\nNCN 1.3516e+06 0.00 0.153 0.030 0.9056 0.0000 0.9035 0.0000\n0.5 Fast-Estimation 1.3712e+06 6201.06 0.094 0.021 0.8984 0.0142 0.8924 0.0188\nFast-Filtering 1.2064e+06 12137.27 0.016 0.017 0.8928 0.0156 0.8839 0.0202\nFast-Sampling 1.3790e+06 7845.73 1.658 0.614 0.8867 0.0082 0.8791 0.0094\nOurs 1.2051e+06 0.00 0.022 0.019 0.9150 0.0000 0.9078 0.0000", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 37, + "total_chunks": 39, + "char_count": 4794, + "word_count": 685, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "012a71f1-c856-47d1-9cea-2d115dea4ab5", + "text": "Table 15: Performance comparison on PHY dataset with fixed k = 10 and varied α. Condition Cost Time(s) NMI ARI 0 K-Med++ 1.0147e+12 - - - - - - - Predictor 1.1078e+12 - - - - - - -\nEFSplus 1.0230e+12 6.5840e+08 0.357 0.138 0.9769 0.0052 0.9758 0.0066\nNCN 1.0167e+12 0.00 1.083 0.274 0.9937 0.0000 0.9953 0.0000\n0.1 Fast-Estimation 1.0239e+12 3.2252e+09 0.284 0.090 0.9776 0.0063 0.9778 0.0079\nFast-Filtering 1.0171e+12 5.2853e+08 0.091 0.050 0.9913 0.0050 0.9924 0.0060\nFast-Sampling 1.0258e+12 1.4804e+09 1.931 0.497 0.9749 0.0055 0.9727 0.0069\nOurs 1.0161e+12 5.6604e+08 0.206 0.072 0.9859 0.0041 0.9857 0.0051 Predictor 1.3930e+12 - - - - - - -\nEFSplus 1.0204e+12 2.4351e+08 0.340 0.083 0.9896 0.0025 0.9913 0.0025\nNCN 1.0219e+12 0.00 1.560 0.385 0.9859 0.0000 0.9884 0.0000\n0.2 Fast-Estimation 1.0532e+12 1.6020e+10 0.321 0.119 0.9527 0.0144 0.9420 0.0228\nFast-Filtering 1.0176e+12 3.4455e+08 0.097 0.039 0.9913 0.0019 0.9923 0.0023\nFast-Sampling 1.0219e+12 7.6533e+08 1.847 0.523 0.9825 0.0034 0.9832 0.0042\nOurs 1.0173e+12 2.2068e+08 0.189 0.063 0.9924 0.0016 0.9930 0.0020 Predictor 1.8631e+12 - - - - - - -\nEFSplus 1.0528e+12 8.8879e+09 0.438 0.143 0.9646 0.0059 0.9594 0.0085\nNCN 1.0300e+12 0.00 2.333 0.418 0.9874 0.0000 0.9859 0.0000\n0.3 Fast-Estimation 1.1067e+12 2.2806e+10 0.275 0.080 0.9309 0.0148 0.9074 0.0290\nFast-Filtering 1.0227e+12 2.4148e+09 0.095 0.035 0.9812 0.0066 0.9794 0.0093\nFast-Sampling 1.0418e+12 6.5694e+09 1.734 0.574 0.9693 0.0068 0.9656 0.0107\nOurs 1.0193e+12 1.2542e+09 0.184 0.055 0.9832 0.0043 0.9816 0.0061 Predictor 2.6240e+12 - - - - - - -\nEFSplus 1.0446e+12 2.3424e+09 0.438 0.083 0.9815 0.0037 0.9793 0.0049\nNCN 1.0367e+12 0.00 2.554 0.609 0.9787 0.0000 0.9770 0.0000\n0.4 Fast-Estimation 1.1720e+12 6.0011e+10 0.338 0.105 0.9207 0.0199 0.8932 0.0328\nFast-Filtering 1.0313e+12 6.8650e+09 0.090 0.031 0.9780 0.0098 0.9753 0.0151\nFast-Sampling 1.0510e+12 5.7277e+09 1.740 0.408 0.9609 0.0052 0.9568 0.0081\nOurs 1.0257e+12 2.5217e+09 0.215 0.036 0.9759 0.0106 0.9724 0.0140 Predictor 3.5556e+12 - - - - - - -\nEFSplus 1.0512e+12 4.1080e+09 0.407 0.070 0.9733 0.0053 0.9726 0.0068\nNCN 1.0449e+12 0.00 2.910 0.861 0.9794 0.0000 0.9797 0.0000\n0.5 Fast-Estimation 1.2459e+12 1.0526e+11 0.277 0.109 0.9056 0.0410 0.8617 0.0765\nFast-Filtering 1.0389e+12 4.5319e+09 0.081 0.024 0.9714 0.0089 0.9679 0.0125\nFast-Sampling 1.0835e+12 2.6368e+10 1.711 0.410 0.9453 0.0147 0.9278 0.0254\nOurs 1.0278e+12 1.8072e+09 0.215 0.045 0.9758 0.0067 0.9728 0.0094 Table 16: Performance comparison on CIFAR-10 dataset with fixed α = 0.2 and varied k. Condition Cost Time(s) NMI ARI K-Med++ 7.8584e+10 - - - - - - -\nPredictor 8.0164e+10 - - - - - - -\nEFSplus 7.9181e+10 2.64e+06 1.347 0.025 0.8625 0.0008 0.8403 0.0008\nNCN 7.9408e+10 0.00 3.678 0.030 0.8282 0.0000 0.7948 0.0000\nFast-Estimation 7.9677e+10 2.54e+08 1.345 0.022 0.7905 0.0183 0.7350 0.0259\nFast-Filtering 7.8971e+10 3.99e+07 0.163 0.005 0.8786 0.0039 0.8673 0.0052\nFast-Sampling 7.9523e+10 3.50e+05 28.082 0.664 0.8311 0.0003 0.7883 0.0007\nOurs 7.8896e+10 3.46e+07 0.101 0.001 0.8889 0.0073 0.8816 0.0106 K-Med++ 7.3059e+10 - - - - - - -\nPredictor 7.4801e+10 - - - - - - -\nEFSplus 7.3875e+10 2.71e+06 1.784 0.002 0.8740 0.0003 0.8398 0.0004\nNCN 7.3838e+10 0.00 3.846 0.007 0.8384 0.0000 0.7678 0.0000\nFast-Estimation 7.4010e+10 3.28e+08 1.650 0.007 0.7990 0.0218 0.7156 0.0321\nFast-Filtering 7.3470e+10 4.43e+07 0.136 0.002 0.8836 0.0084 0.8578 0.0121\nFast-Sampling 7.4160e+10 1.73e+06 52.049 0.157 0.8401 0.0007 0.7770 0.0010\nOurs 7.3405e+10 5.11e+07 0.087 0.001 0.8990 0.0093 0.8803 0.0121 K-Med++ 7.0352e+10 - - - - - - -\nPredictor 7.2285e+10 - - - - - - -\nEFSplus 7.1369e+10 3.01e+06 2.263 0.005 0.8714 0.0015 0.8257 0.0023\nNCN 7.1273e+10 0.00 4.083 0.011 0.8401 0.0000 0.7589 0.0000\nFast-Estimation 7.1387e+10 1.31e+08 1.997 0.010 0.7977 0.0101 0.7013 0.0158\nFast-Filtering 7.0864e+10 6.16e+07 0.130 0.001 0.8793 0.0057 0.8381 0.0083\nFast-Sampling 7.1625e+10 7.08e+05 72.828 0.130 0.8316 0.0008 0.7495 0.0011\nOurs 7.0763e+10 2.71e+07 0.086 0.000 0.8963 0.0059 0.8636 0.0086 K-Med++ 6.8488e+10 - - - - - - -\nPredictor 7.0455e+10 - - - - - - -\nEFSplus 6.9651e+10 4.65e+06 2.741 0.007 0.8784 0.0008 0.8265 0.0011\nNCN 6.9464e+10 0.00 4.332 0.005 0.8581 0.0000 0.7825 0.0000\nFast-Estimation 6.9656e+10 8.30e+07 2.345 0.003 0.8035 0.0097 0.6949 0.0230\nFast-Filtering 6.9034e+10 4.77e+07 0.124 0.002 0.8881 0.0029 0.8463 0.0037\nFast-Sampling 6.9830e+10 1.44e+06 92.965 0.099 0.8377 0.0004 0.7471 0.0006\nOurs 6.8911e+10 3.58e+07 0.087 0.000 0.9042 0.0023 0.8716 0.0028 K-Med++ 6.7185e+10 - - - - - - -\nPredictor 6.9137e+10 - - - - - - -\nEFSplus 6.8442e+10 2.30e+06 3.214 0.009 0.8793 0.0007 0.8159 0.0009\nNCN 6.8172e+10 0.00 4.586 0.026 0.8549 0.0000 0.7636 0.0000\nFast-Estimation 6.8313e+10 1.05e+08 2.712 0.018 0.8011 0.0164 0.6804 0.0276\nFast-Filtering 6.7753e+10 1.15e+08 0.121 0.002 0.8843 0.0161 0.8327 0.0256\nFast-Sampling 6.8527e+10 1.20e+06 113.940 0.304 0.8326 0.0004 0.7240 0.0007\nOurs 6.7642e+10 2.57e+07 0.087 0.000 0.9023 0.0030 0.8602 0.0041", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 38, + "total_chunks": 39, + "char_count": 5033, + "word_count": 730, + "chunking_strategy": "semantic" + }, + { + "chunk_id": "963332cb-4451-4ff7-8bf1-b254b9ac74f0", + "text": "Condition Cost Time(s) NMI ARI K-Med++ 1.2454e+11 - - - - - - -\nPredictor 1.3036e+11 - - - - - - -\nEFSplus 1.3029e+11 1.07e+07 1.227 0.004 0.8876 0.0005 0.8857 0.0006\nNCN 1.2941e+11 0.00 1.446 0.002 0.9215 0.0000 0.9266 0.0000\nFast-Estimation 1.2933e+11 6.37e+08 1.203 0.003 0.9091 0.0214 0.9105 0.0267\nFast-Filtering 1.2576e+11 1.35e+08 0.272 0.006 0.9279 0.0148 0.9338 0.0211\nFast-Sampling 1.2864e+11 4.18e+06 6.673 0.013 0.9258 0.0001 0.9338 0.0001\nOurs 1.2541e+11 5.86e+07 0.205 0.000 0.9353 0.0039 0.9430 0.0049 K-Med++ 1.0408e+11 - - - - - - -\nPredictor 1.1107e+11 - - - - - - -\nEFSplus 1.1060e+11 8.42e+06 1.364 0.003 0.8941 0.0004 0.8549 0.0005\nNCN 1.0993e+11 0.00 1.490 0.004 0.9208 0.0000 0.9002 0.0000\nFast-Estimation 1.0858e+11 3.34e+08 1.275 0.004 0.8897 0.0009 0.8484 0.0051\nFast-Filtering 1.0462e+11 9.59e+07 0.248 0.001 0.9464 0.0036 0.9400 0.0045\nFast-Sampling 1.0900e+11 2.67e+06 12.059 0.032 0.9281 0.0002 0.9122 0.0003\nOurs 1.0446e+11 1.13e+07 0.201 0.000 0.9572 0.0011 0.9552 0.0017 K-Med++ 9.6166e+10 - - - - - - -\nPredictor 1.0296e+11 - - - - - - -\nEFSplus 1.0253e+11 2.13e+07 1.506 0.006 0.8814 0.0003 0.8134 0.0007\nNCN 1.0172e+11 0.00 5.078 0.010 0.8925 0.0000 0.8337 0.0000\nFast-Estimation 1.0050e+11 3.14e+08 1.363 0.003 0.8874 0.0091 0.8322 0.0196\nFast-Filtering 9.6482e+10 3.18e+07 0.224 0.001 0.9546 0.0030 0.9435 0.0060\nFast-Sampling 1.0091e+11 2.98e+06 16.786 0.136 0.9120 0.0001 0.8736 0.0002\nOurs 9.6401e+10 1.04e+07 0.199 0.001 0.9645 0.0013 0.9576 0.0022 K-Med++ 9.1325e+10 - - - - - - -\nPredictor 9.8290e+10 - - - - - - -\nEFSplus 9.7755e+10 9.39e+06 1.593 0.007 0.8784 0.0003 0.7985 0.0007\nNCN 9.6438e+10 0.00 5.042 0.005 0.8950 0.0000 0.8224 0.0000\nFast-Estimation 9.4824e+10 1.85e+08 1.408 0.002 0.8880 0.0037 0.8184 0.0116\nFast-Filtering 9.1581e+10 4.81e+07 0.224 0.001 0.9534 0.0039 0.9397 0.0054\nFast-Sampling 9.6159e+10 1.67e+06 21.250 0.065 0.9117 0.0002 0.8683 0.0002\nOurs 9.1515e+10 4.14e+06 0.201 0.001 0.9621 0.0019 0.9515 0.0031 K-Med++ 8.7954e+10 - - - - - - -\nPredictor 9.5100e+10 - - - - - - -\nEFSplus 9.4367e+10 2.08e+07 1.714 0.006 0.8737 0.0005 0.7765 0.0011\nNCN 9.2909e+10 0.00 5.102 0.016 0.8878 0.0000 0.7986 0.0000\nFast-Estimation 9.1920e+10 1.39e+08 1.492 0.001 0.8759 0.0026 0.7837 0.0067\nFast-Filtering 8.8252e+10 4.96e+07 0.211 0.001 0.9506 0.0010 0.9300 0.0014\nFast-Sampling 9.2943e+10 3.48e+06 25.838 0.054 0.9072 0.0001 0.8537 0.0002\nOurs 8.8179e+10 6.09e+06 0.204 0.002 0.9609 0.0011 0.9486 0.0019 Table 18: Performance comparison on PHY dataset with fixed α = 0.2 and varied k. Condition Cost Time(s) NMI ARI K-Med++ 1.0147e+12 - - - - - - -\nPredictor 1.3965e+12 - - - - - - -\nEFSplus 1.0562e+12 4.32e+09 0.109 0.000 0.9591 0.0045 0.9523 0.0068\nNCN 1.0209e+12 0.00 0.411 0.001 0.9829 0.0000 0.9834 0.0000\nFast-Estimation 1.0657e+12 1.48e+10 0.096 0.000 0.9530 0.0149 0.9439 0.0218\nFast-Filtering 1.0246e+12 1.43e+08 0.020 0.000 0.9882 0.0007 0.9902 0.0008\nFast-Sampling 1.0620e+12 2.00e+10 0.471 0.004 0.9590 0.0065 0.9516 0.0120\nOurs 1.0206e+12 1.47e+09 0.043 0.000 0.9737 0.0074 0.9705 0.0104 K-Med++ 2.9135e+11 - - - - - - -\nPredictor 6.1176e+11 - - - - - - -\nEFSplus 3.0869e+11 6.94e+08 0.126 0.000 0.9654 0.0024 0.9488 0.0039\nNCN 2.9964e+11 0.00 0.414 0.001 0.9858 0.0000 0.9827 0.0000\nFast-Estimation 3.1909e+11 1.05e+10 0.102 0.001 0.9311 0.0123 0.8793 0.0287\nFast-Filtering 2.9358e+11 4.09e+08 0.022 0.000 0.9720 0.0015 0.9592 0.0026\nFast-Sampling 3.0647e+11 1.16e+09 0.821 0.006 0.9617 0.0066 0.9415 0.0116\nOurs 2.9357e+11 4.20e+08 0.046 0.000 0.9768 0.0041 0.9675 0.0074 K-Med++ 1.6400e+11 - - - - - - -\nPredictor 4.3770e+11 - - - - - - -\nEFSplus 1.7918e+11 3.16e+08 0.141 0.000 0.9695 0.0052 0.9509 0.0095\nNCN 1.7292e+11 0.00 0.416 0.002 0.9797 0.0000 0.9695 0.0000\nFast-Estimation 1.9011e+11 7.78e+09 0.111 0.001 0.9341 0.0056 0.8711 0.0146\nFast-Filtering 1.6490e+11 1.16e+08 0.023 0.000 0.9775 0.0017 0.9655 0.0035\nFast-Sampling 1.7574e+11 4.23e+08 1.189 0.003 0.9709 0.0037 0.9552 0.0076\nOurs 1.6468e+11 9.90e+07 0.051 0.000 0.9778 0.0021 0.9666 0.0034 K-Med++ 1.1922e+11 - - - - - - -\nPredictor 3.6763e+11 - - - - - - -\nEFSplus 1.3394e+11 1.26e+08 0.161 0.000 0.9725 0.0015 0.9527 0.0031\nNCN 1.2603e+11 0.00 0.421 0.001 0.9736 0.0000 0.9549 0.0000\nFast-Estimation 1.4017e+11 3.93e+09 0.119 0.000 0.9291 0.0051 0.8526 0.0115\nFast-Filtering 1.2006e+11 5.77e+07 0.024 0.000 0.9838 0.0024 0.9748 0.0049\nFast-Sampling 1.3034e+11 4.65e+08 1.573 0.007 0.9705 0.0053 0.9482 0.0110\nOurs 1.1993e+11 3.56e+07 0.057 0.000 0.9862 0.0033 0.9787 0.0061 K-Med++ 9.9887e+10 - - - - - - -\nPredictor 3.1986e+11 - - - - - - -\nEFSplus 1.1631e+11 1.98e+08 0.175 0.000 0.9736 0.0020 0.9562 0.0045\nNCN 1.0962e+11 0.00 0.428 0.002 0.9668 0.0000 0.9365 0.0000\nFast-Estimation 1.2197e+11 5.73e+09 0.126 0.001 0.9228 0.0091 0.8259 0.0259\nFast-Filtering 1.0261e+11 2.44e+07 0.025 0.000 0.9803 0.0009 0.9704 0.0018\nFast-Sampling 1.1200e+11 1.96e+08 1.896 0.011 0.9689 0.0033 0.9463 0.0074\nOurs 1.0254e+11 1.70e+07 0.062 0.000 0.9826 0.0016 0.9735 0.0029", + "paper_id": "2603.10721", + "title": "Sample-and-Search: An Effective Algorithm for Learning-Augmented k-Median Clustering in High dimensions", + "authors": [ + "Kangke Cheng", + "Shihong Song", + "Guanlin Mo", + "Hu Ding" + ], + "published_date": "2026-03-11", + "primary_category": "", + "arxiv_url": "http://arxiv.org/abs/2603.10721v1", + "chunk_index": 40, + "total_chunks": 39, + "char_count": 5006, + "word_count": 745, + "chunking_strategy": "semantic" + } +] \ No newline at end of file