new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 17

How to Detect Network Dependence in Latent Factor Models? A Bias-Corrected CD Test

In a recent paper Juodis and Reese (2022) (JR) show that the application of the CD test proposed by Pesaran (2004) to residuals from panels with latent factors results in over-rejection. They propose a randomized test statistic to correct for over-rejection, and add a screening component to achieve power. This paper considers the same problem but from a different perspective, and shows that the standard CD test remains valid if the latent factors are weak in the sense the strength is less than half. In the case where latent factors are strong, we propose a bias-corrected version, CD*, which is shown to be asymptotically standard normal under the null of error cross-sectional independence and have power against network type alternatives. This result is shown to hold for pure latent factor models as well as for panel regression models with latent factors. The case where the errors are serially correlated is also considered. Small sample properties of the CD* test are investigated by Monte Carlo experiments and are shown to have the correct size for strong and weak factors as well as for Gaussian and non-Gaussian errors. In contrast, it is found that JR's test tends to over-reject in the case of panels with non-Gaussian errors, and has low power against spatial network alternatives. In an empirical application, using the CD* test, it is shown that there remains spatial error dependence in a panel data model for real house price changes across 377 Metropolitan Statistical Areas in the U.S., even after the effects of latent factors are filtered out.

  • 2 authors
·
Sep 1, 2021

Exploring Trade Openness and Logistics Efficiency in the G20 Economies: A Bootstrap ARDL Analysis of Growth Dynamics

This study examines the relationship between trade openness, logistics performance, and economic growth within G20 economies. Using a Bootstrap Autoregressive Distributed Lag (ARDL) model augmented by a dynamic error correction mechanism (ECM), the analysis quantifies both short run and long run effects of trade facilitation and logistics infrastructure, measured via the World Bank's Logistics Performance Index (LPI) from 2007 to 2023, on economic growth. The G20, as a consortium of the world's leading economies, exhibits significant variation in logistics efficiency and degrees of trade openness, providing a robust context for comparative analysis. The ARDL-ECM approach, reinforced by bootstrap resampling, delivers reliable estimates even in the presence of small samples and complex variable linkages. Findings are intended to inform policymakers seeking to enhance trade competitiveness and economic development through targeted investment in infrastructure and regulatory reforms supporting trade facilitation. The results underscore the critical role of efficient logistics specifically customs administration, physical infrastructure, and shipment reliability in driving international trade and fostering sustained economic growth. Improvements in these areas can substantially increase a country's trade capacity and overall economic performance.

  • 2 authors
·
Aug 30, 2025

Extending Mixture of Experts Model to Investigate Heterogeneity of Trajectories: When, Where and How to Add Which Covariates

Researchers are usually interested in examining the impact of covariates when separating heterogeneous samples into latent classes that are more homogeneous. The majority of theoretical and empirical studies with such aims have focused on identifying covariates as predictors of class membership in the structural equation modeling framework. In other words, the covariates only indirectly affect the sample heterogeneity. However, the covariates' influence on between-individual differences can also be direct. This article presents a mixture model that investigates covariates to explain within-cluster and between-cluster heterogeneity simultaneously, known as a mixture-of-experts (MoE) model. This study aims to extend the MoE framework to investigate heterogeneity in nonlinear trajectories: to identify latent classes, covariates as predictors to clusters, and covariates that explain within-cluster differences in change patterns over time. Our simulation studies demonstrate that the proposed model generally estimates the parameters unbiasedly, precisely and exhibits appropriate empirical coverage for a nominal 95% confidence interval. This study also proposes implementing structural equation model forests to shrink the covariate space of the proposed mixture model. We illustrate how to select covariates and construct the proposed model with longitudinal mathematics achievement data. Additionally, we demonstrate that the proposed mixture model can be further extended in the structural equation modeling framework by allowing the covariates that have direct effects to be time-varying.

  • 2 authors
·
Jul 5, 2020

Beyond the Mean: Limit Theory and Tests for Infinite-Mean Autoregressive Conditional Durations

Integrated autoregressive conditional duration (ACD) models serve as natural counterparts to the well-known integrated GARCH models used for financial returns. However, despite their resemblance, asymptotic theory for ACD is challenging and also not complete, in particular for integrated ACD. Central challenges arise from the facts that (i) integrated ACD processes imply durations with infinite expectation, and (ii) even in the non-integrated case, conventional asymptotic approaches break down due to the randomness in the number of durations within a fixed observation period. Addressing these challenges, we provide here unified asymptotic theory for the (quasi-) maximum likelihood estimator for ACD models; a unified theory which includes integrated ACD models. Based on the new results, we also provide a novel framework for hypothesis testing in duration models, enabling inference on a key empirical question: whether durations possess a finite or infinite expectation. We apply our results to high-frequency cryptocurrency ETF trading data. Motivated by parameter estimates near the integrated ACD boundary, we assess whether durations between trades in these markets have finite expectation, an assumption often made implicitly in the literature on point process models. Our empirical findings indicate infinite-mean durations for all the five cryptocurrencies examined, with the integrated ACD hypothesis rejected -- against alternatives with tail index less than one -- for four out of the five cryptocurrencies considered.

  • 4 authors
·
May 9, 2025

Dynamic Factor Analysis of Price Movements in the Philippine Stock Exchange

The intricate dynamics of stock markets have led to extensive research on models that are able to effectively explain their inherent complexities. This study leverages the econometrics literature to explore the dynamic factor model as an interpretable model with sufficient predictive capabilities for capturing essential market phenomena. Although the model has been extensively applied for predictive purposes, this study focuses on analyzing the extracted loadings and common factors as an alternative framework for understanding stock price dynamics. The results reveal novel insights into traditional market theories when applied to the Philippine Stock Exchange using the Kalman method and maximum likelihood estimation, with subsequent validation against the capital asset pricing model. Notably, a one-factor model extracts a common factor representing systematic or market dynamics similar to the composite index, whereas a two-factor model extracts common factors representing market trends and volatility. Furthermore, an application of the model for nowcasting the growth rates of the Philippine gross domestic product highlights the potential of the extracted common factors as viable real-time market indicators, yielding over a 34% decrease in the out-of-sample prediction error. Overall, the results underscore the value of dynamic factor analysis in gaining a deeper understanding of market price movement dynamics.

  • 6 authors
·
Oct 8, 2025

An Investigation of the Structural Characteristics of the Indian IT Sector and the Capital Goods Sector: An Application of the R Programming in Time Series Decomposition and Forecasting

Time series analysis and forecasting of stock market prices has been a very active area of research over the last two decades. Availability of extremely fast and parallel architecture of computing and sophisticated algorithms has made it possible to extract, store, process and analyze high volume stock market time series data very efficiently. In this paper, we have used time series data of the two sectors of the Indian economy: Information Technology and Capital Goods for the period January 2009 till April 2016 and have studied the relationships of these two time series with the time series of DJIA index, NIFTY index and the US Dollar to Indian Rupee exchange rate. We establish by graphical and statistical tests that while the IT sector of India has a strong association with DJIA index and the Dollar to Rupee exchange rate, the Indian CG sector exhibits a strong association with the NIFTY index. We contend that these observations corroborate our hypotheses that the Indian IT sector is strongly coupled with the world economy whereas the CG sector of India reflects internal economic growth of India. We also present several models of regression between the time series which exhibit strong association among them. The effectiveness of these models have been demonstrated by very low values of their forecasting errors.

  • 2 authors
·
May 14, 2017

Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions

Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods.

  • 2 authors
·
Sep 2, 2024

CAMEF: Causal-Augmented Multi-Modality Event-Driven Financial Forecasting by Integrating Time Series Patterns and Salient Macroeconomic Announcements

Accurately forecasting the impact of macroeconomic events is critical for investors and policymakers. Salient events like monetary policy decisions and employment reports often trigger market movements by shaping expectations of economic growth and risk, thereby establishing causal relationships between events and market behavior. Existing forecasting methods typically focus either on textual analysis or time-series modeling, but fail to capture the multi-modal nature of financial markets and the causal relationship between events and price movements. To address these gaps, we propose CAMEF (Causal-Augmented Multi-Modality Event-Driven Financial Forecasting), a multi-modality framework that effectively integrates textual and time-series data with a causal learning mechanism and an LLM-based counterfactual event augmentation technique for causal-enhanced financial forecasting. Our contributions include: (1) a multi-modal framework that captures causal relationships between policy texts and historical price data; (2) a new financial dataset with six types of macroeconomic releases from 2008 to April 2024, and high-frequency real trading data for five key U.S. financial assets; and (3) an LLM-based counterfactual event augmentation strategy. We compare CAMEF to state-of-the-art transformer-based time-series and multi-modal baselines, and perform ablation studies to validate the effectiveness of the causal learning mechanism and event types.

  • 5 authors
·
Aug 7, 2025

Two-stage Estimation of Latent Variable Regression Models: A General, Root-N Consistent Solution

Latent variable (LV) models are widely used in psychological research to investigate relationships among unobservable constructs. When one-stage estimation of the overall LV model is challenging, two-stage factor score regression (FSR) serves as a convenient alternative: the measurement model is fitted to obtain factor scores in the first stage, which are then used to fit the structural model in the subsequent stage. However, naive application of FSR is known to yield biased estimates of structural parameters. In this paper, we develop a generic bias-correction framework for two-stage estimation of parametric statistical models and tailor it specifically to FSR. Unlike existing bias-corrected FSR solutions, the proposed method applies to a broader class of LV models and does not require computing specific types of factor scores. We establish the root-n consistency of the proposed bias-corrected two-stage estimator under mild regularity conditions. To ensure broad applicability and minimize reliance on complex analytical derivations, we introduce a stochastic approximation algorithm for point estimation and a Monte Carlo-based procedure for variance estimation. In a sequence of Monte Carlo experiments, we demonstrate that the bias-corrected FSR estimator performs comparably to the ``gold standard'' one-stage maximum likelihood estimator. These results suggest that our approach offers a straightforward yet effective alternative for estimating LV models.

  • 7 authors
·
Jan 24

Information-Theoretic Causal Bounds under Unmeasured Confounding

We develop a data-driven information-theoretic framework for sharp partial identification of causal effects under unmeasured confounding. Existing approaches often rely on restrictive assumptions, such as bounded or discrete outcomes; require external inputs (for example, instrumental variables, proxies, or user-specified sensitivity parameters); necessitate full structural causal model specifications; or focus solely on population-level averages while neglecting covariate-conditional effects. We overcome all four limitations simultaneously by establishing novel information-theoretic, data-driven divergence bounds. Our key theoretical contribution shows that the f-divergence between the observational distribution P(Y | A = a, X = x) and the interventional distribution P(Y | do(A = a), X = x) is upper bounded by a function of the propensity score alone. This result enables sharp partial identification of conditional causal effects directly from observational data, without requiring external sensitivity parameters, auxiliary variables, full structural specifications, or outcome boundedness assumptions. For practical implementation, we develop a semiparametric estimator satisfying Neyman orthogonality (Chernozhukov et al., 2018), which ensures root-n consistent inference even when nuisance functions are estimated via flexible machine learning methods. Simulation studies and real-world data applications, implemented in the GitHub repository (https://github.com/yonghanjung/Information-Theretic-Bounds), demonstrate that our framework provides tight and valid causal bounds across a wide range of data-generating processes.

  • 2 authors
·
Jan 23

Chinese vs. World Bank Development Projects: Insights from Earth Observation and Computer Vision on Wealth Gains in Africa, 2002-2013

Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002 to 2013), representative of 88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with rich tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery systematically shrinks effects relative to tabular-only models, indicating prior work likely overstated benefits. On average, both donors raise wealth, with larger gains for China; sector extremes in our sample include Trade and Tourism for the World Bank (+6.27 IWI points), and Emergency Response for China (+14.32). Assignment-mechanism analyses show World Bank placement is generally more predictable from imagery alone, as well as from tabular covariates. This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 450 times finer than prior fixed effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but directionally consistent effects.

Enhancing Price Prediction in Cryptocurrency Using Transformer Neural Network and Technical Indicators

This study presents an innovative approach for predicting cryptocurrency time series, specifically focusing on Bitcoin, Ethereum, and Litecoin. The methodology integrates the use of technical indicators, a Performer neural network, and BiLSTM (Bidirectional Long Short-Term Memory) to capture temporal dynamics and extract significant features from raw cryptocurrency data. The application of technical indicators, such facilitates the extraction of intricate patterns, momentum, volatility, and trends. The Performer neural network, employing Fast Attention Via positive Orthogonal Random features (FAVOR+), has demonstrated superior computational efficiency and scalability compared to the traditional Multi-head attention mechanism in Transformer models. Additionally, the integration of BiLSTM in the feedforward network enhances the model's capacity to capture temporal dynamics in the data, processing it in both forward and backward directions. This is particularly advantageous for time series data where past and future data points can influence the current state. The proposed method has been applied to the hourly and daily timeframes of the major cryptocurrencies and its performance has been benchmarked against other methods documented in the literature. The results underscore the potential of the proposed method to outperform existing models, marking a significant progression in the field of cryptocurrency price prediction.

  • 2 authors
·
Mar 6, 2024

An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation

The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) -- sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.

  • 5 authors
·
Oct 28, 2025 1

Effect Heterogeneity with Earth Observation in Randomized Controlled Trials: Exploring the Role of Data, Model, and Evaluation Metric Choice

Many social and environmental phenomena are associated with macroscopic changes in the built environment, captured by satellite imagery on a global scale and with daily temporal resolution. While widely used for prediction, these images and especially image sequences remain underutilized for causal inference, especially in the context of randomized controlled trials (RCTs), where causal identification is established by design. In this paper, we develop and compare a set of general tools for analyzing Conditional Average Treatment Effects (CATEs) from temporal satellite data that can be applied to any RCT where geographical identifiers are available. Through a simulation study, we analyze different modeling strategies for estimating CATE in sequences of satellite images. We find that image sequence representation models with more parameters generally yield a greater ability to detect heterogeneity. To explore the role of model and data choice in practice, we apply the approaches to two influential RCTs -- Banerjee et al. (2015), a poverty study in Cusco, Peru, and Bolsen et al. (2014), a water conservation experiment in Georgia, USA. We benchmark our image sequence models against image-only, tabular-only, and combined image-tabular data sources, summarizing practical implications for investigators in a multivariate analysis. Land cover classifications over satellite images facilitate interpretation of what image features drive heterogeneity. We also show robustness to data and model choice of satellite-based generalization of the RCT results to larger geographical areas outside the original. Overall, this paper shows how satellite sequence data can be incorporated into the analysis of RCTs, and provides evidence about the implications of data, model, and evaluation metric choice for causal analysis.

Integrating Earth Observation Data into Causal Inference: Challenges and Opportunities

Observational studies require adjustment for confounding factors that are correlated with both the treatment and outcome. In the setting where the observed variables are tabular quantities such as average income in a neighborhood, tools have been developed for addressing such confounding. However, in many parts of the developing world, features about local communities may be scarce. In this context, satellite imagery can play an important role, serving as a proxy for the confounding variables otherwise unobserved. In this paper, we study confounder adjustment in this non-tabular setting, where patterns or objects found in satellite images contribute to the confounder bias. Using the evaluation of anti-poverty aid programs in Africa as our running example, we formalize the challenge of performing causal adjustment with such unstructured data -- what conditions are sufficient to identify causal effects, how to perform estimation, and how to quantify the ways in which certain aspects of the unstructured image object are most predictive of the treatment decision. Via simulation, we also explore the sensitivity of satellite image-based observational inference to image resolution and to misspecification of the image-associated confounder. Finally, we apply these tools in estimating the effect of anti-poverty interventions in African communities from satellite imagery.

Selective Machine Learning of the Average Treatment Effect with an Invalid Instrumental Variable

Instrumental variable methods have been widely used to identify causal effects in the presence of unmeasured confounding. A key identification condition known as the exclusion restriction states that the instrument cannot have a direct effect on the outcome which is not mediated by the exposure in view. In the health and social sciences, such an assumption is often not credible. To address this concern, we consider identification conditions of the population average treatment effect with an invalid instrumental variable which does not satisfy the exclusion restriction, and derive the efficient influence function targeting the identifying functional under a nonparametric observed data model. We propose a novel multiply robust locally efficient estimator of the average treatment effect that is consistent in the union of multiple parametric nuisance models, as well as a multiply debiased machine learning estimator for which the nuisance parameters are estimated using generic machine learning methods, that effectively exploit various forms of linear or nonlinear structured sparsity in the nuisance parameter space. When one cannot be confident that any of these machine learners is consistent at sufficiently fast rates to ensure n-consistency for the average treatment effect, we introduce a new criteria for selective machine learning which leverages the multiple robustness property in order to ensure small bias. The proposed methods are illustrated through extensive simulations and a data analysis evaluating the causal effect of 401(k) participation on savings.

  • 3 authors
·
Jul 27, 2019

Partial Correlations in Compositional Data Analysis

Partial correlations quantify linear association between two variables adjusting for the influence of the remaining variables. They form the backbone for graphical models and are readily obtained from the inverse of the covariance matrix. For compositional data, the covariance structure is specified from log ratios of variables, so unless we try to "open" the data via a normalization, this implies changes in the definition and interpretation of partial correlations. In the present work, we elucidate how results derived by Aitchison (1986) lead to a natural definition of partial correlation that has a number of advantages over current measures of association. For this, we show that the residuals of log-ratios between a variable with a reference, when adjusting for all remaining variables including the reference, are reference-independent. Since the reference itself can be controlled for, correlations between residuals are defined for the variables directly without the necessity to recur to ratios except when specifying which variables are partialled out. Thus, perhaps surprisingly, partial correlations do not have the problems commonly found with measures of pairwise association on compositional data. They are well-defined between two variables, are properly scaled, and allow for negative association. By design, they are subcompositionally incoherent, but they share this property with conventional partial correlations (where results change when adjusting for the influence of fewer variables). We discuss the equivalence with normalization-based approaches whenever the normalizing variables are controlled for. We also discuss the partial variances and correlations we obtain from a previously studied data set of Roman glass cups.

  • 1 authors
·
Apr 20, 2019

Probabilistic AutoRegressive Neural Networks for Accurate Long-range Forecasting

Forecasting time series data is a critical area of research with applications spanning from stock prices to early epidemic prediction. While numerous statistical and machine learning methods have been proposed, real-life prediction problems often require hybrid solutions that bridge classical forecasting approaches and modern neural network models. In this study, we introduce the Probabilistic AutoRegressive Neural Networks (PARNN), capable of handling complex time series data exhibiting non-stationarity, nonlinearity, non-seasonality, long-range dependence, and chaotic patterns. PARNN is constructed by improving autoregressive neural networks (ARNN) using autoregressive integrated moving average (ARIMA) feedback error, combining the explainability, scalability, and "white-box-like" prediction behavior of both models. Notably, the PARNN model provides uncertainty quantification through prediction intervals, setting it apart from advanced deep learning tools. Through comprehensive computational experiments, we evaluate the performance of PARNN against standard statistical, machine learning, and deep learning models, including Transformers, NBeats, and DeepAR. Diverse real-world datasets from macroeconomics, tourism, epidemiology, and other domains are employed for short-term, medium-term, and long-term forecasting evaluations. Our results demonstrate the superiority of PARNN across various forecast horizons, surpassing the state-of-the-art forecasters. The proposed PARNN model offers a valuable hybrid solution for accurate long-range forecasting. By effectively capturing the complexities present in time series data, it outperforms existing methods in terms of accuracy and reliability. The ability to quantify uncertainty through prediction intervals further enhances the model's usefulness in decision-making processes.

  • 4 authors
·
Apr 1, 2022

Learning Interactions Between Continuous Treatments and Covariates with a Semiparametric Model

Estimating the impact of continuous treatment variables (e.g., dosage amount) on binary outcomes presents significant challenges in modeling and estimation because many existing approaches make strong assumptions that do not hold for certain continuous treatment variables. For instance, traditional logistic regression makes strong linearity assumptions that do not hold for continuous treatment variables like time of initiation. In this work, we propose a semiparametric regression framework that decomposes effects into two interpretable components: a prognostic score that captures baseline outcome risk based on a combination of clinical, genetic, and sociodemographic features, and a treatment-interaction score that flexibly models the optimal treatment level via a nonparametric link function. By connecting these two parametric scores with Nadaraya-Watson regression, our approach is both interpretable and flexible. The potential of our approach is demonstrated through numerical simulations that show empirical estimation convergence. We conclude by applying our approach to a real-world case study using the International Warfarin Pharmacogenomics Consortium (IWPC) dataset to show our approach's clinical utility by deriving personalized warfarin dosing recommendations that integrate both genetic and clinical data, providing insights towards enhancing patient safety and therapeutic efficacy in anticoagulation therapy.

  • 3 authors
·
May 6, 2025

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

Social and collaborative platforms emit multivariate time-series traces in which early interactions-such as views, likes, or downloads-are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets-arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories)-and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page views -> edits), Spotify (streams -> concert attendance), e-commerce (click-throughs -> purchases), and LinkedIn profile (views -> messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. Our data portal with downloads and documentation is available at https://lead-lag-forecasting.github.io/.

  • 12 authors
·
Nov 5, 2025

Classification of BCI-EEG based on augmented covariance matrix

Objective: Electroencephalography signals are recorded as a multidimensional dataset. We propose a new framework based on the augmented covariance extracted from an autoregressive model to improve motor imagery classification. Methods: From the autoregressive model can be derived the Yule-Walker equations, which show the emergence of a symmetric positive definite matrix: the augmented covariance matrix. The state-of the art for classifying covariance matrices is based on Riemannian Geometry. A fairly natural idea is therefore to extend the standard approach using these augmented covariance matrices. The methodology for creating the augmented covariance matrix shows a natural connection with the delay embedding theorem proposed by Takens for dynamical systems. Such an embedding method is based on the knowledge of two parameters: the delay and the embedding dimension, respectively related to the lag and the order of the autoregressive model. This approach provides new methods to compute the hyper-parameters in addition to standard grid search. Results: The augmented covariance matrix performed noticeably better than any state-of-the-art methods. We will test our approach on several datasets and several subjects using the MOABB framework, using both within-session and cross-session evaluation. Conclusion: The improvement in results is due to the fact that the augmented covariance matrix incorporates not only spatial but also temporal information, incorporating nonlinear components of the signal through an embedding procedure, which allows the leveraging of dynamical systems algorithms. Significance: These results extend the concepts and the results of the Riemannian distance based classification algorithm.

  • 2 authors
·
Feb 9, 2023

Quantitative Risk Management in Volatile Markets with an Expectile-Based Framework for the FTSE Index

This research presents a framework for quantitative risk management in volatile markets, specifically focusing on expectile-based methodologies applied to the FTSE 100 index. Traditional risk measures such as Value-at-Risk (VaR) have demonstrated significant limitations during periods of market stress, as evidenced during the 2008 financial crisis and subsequent volatile periods. This study develops an advanced expectile-based framework that addresses the shortcomings of conventional quantile-based approaches by providing greater sensitivity to tail losses and improved stability in extreme market conditions. The research employs a dataset spanning two decades of FTSE 100 returns, incorporating periods of high volatility, market crashes, and recovery phases. Our methodology introduces novel mathematical formulations for expectile regression models, enhanced threshold determination techniques using time series analysis, and robust backtesting procedures. The empirical results demonstrate that expectile-based Value-at-Risk (EVaR) consistently outperforms traditional VaR measures across various confidence levels and market conditions. The framework exhibits superior performance during volatile periods, with reduced model risk and enhanced predictive accuracy. Furthermore, the study establishes practical implementation guidelines for financial institutions and provides evidence-based recommendations for regulatory compliance and portfolio management. The findings contribute significantly to the literature on financial risk management and offer practical tools for practitioners dealing with volatile market environments.

  • 1 authors
·
Jul 16, 2025 1

Image-based Treatment Effect Heterogeneity

Randomized controlled trials (RCTs) are considered the gold standard for estimating the average treatment effect (ATE) of interventions. One use of RCTs is to study the causes of global poverty -- a subject explicitly cited in the 2019 Nobel Memorial Prize awarded to Duflo, Banerjee, and Kremer "for their experimental approach to alleviating global poverty." Because the ATE is a population summary, anti-poverty experiments often seek to unpack the effect variation around the ATE by conditioning (CATE) on tabular variables such as age and ethnicity that were measured during the RCT data collection. Although such variables are key to unpacking CATE, using only such variables may fail to capture historical, geographical, or neighborhood-specific contributors to effect variation, as tabular RCT data are often only observed near the time of the experiment. In global poverty research, when the location of the experiment units is approximately known, satellite imagery can provide a window into such factors important for understanding heterogeneity. However, there is no method that specifically enables applied researchers to analyze CATE from images. In this paper, using a deep probabilistic modeling framework, we develop such a method that estimates latent clusters of images by identifying images with similar treatment effects distributions. Our interpretable image CATE model also includes a sensitivity factor that quantifies the importance of image segments contributing to the effect cluster prediction. We compare the proposed methods against alternatives in simulation; also, we show how the model works in an actual RCT, estimating the effects of an anti-poverty intervention in northern Uganda and obtaining a posterior predictive distribution over effects for the rest of the country where no experimental data was collected. We make all models available in open-source software.

Bootstrap aggregation and confidence measures to improve time series causal discovery

Learning causal graphs from multivariate time series is a ubiquitous challenge in all application domains dealing with time-dependent systems, such as in Earth sciences, biology, or engineering, to name a few. Recent developments for this causal discovery learning task have shown considerable skill, notably the specific time-series adaptations of the popular conditional independence-based learning framework. However, uncertainty estimation is challenging for conditional independence-based methods. Here, we introduce a novel bootstrap approach designed for time series causal discovery that preserves the temporal dependencies and lag structure. It can be combined with a range of time series causal discovery methods and provides a measure of confidence for the links of the time series graphs. Furthermore, next to confidence estimation, an aggregation, also called bagging, of the bootstrapped graphs by majority voting results in bagged causal discovery methods. In this work, we combine this approach with the state-of-the-art conditional-independence-based algorithm PCMCI+. With extensive numerical experiments we empirically demonstrate that, in addition to providing confidence measures for links, Bagged-PCMCI+ improves in precision and recall as compared to its base algorithm PCMCI+, at the cost of higher computational demands. These statistical performance improvements are especially pronounced in the more challenging settings (short time sample size, large number of variables, high autocorrelation). Our bootstrap approach can also be combined with other time series causal discovery algorithms and can be of considerable use in many real-world applications.

  • 4 authors
·
Jun 15, 2023

Debiasing Machine Learning Predictions for Causal Inference Without Additional Ground Truth Data: "One Map, Many Trials" in Satellite-Driven Poverty Analysis

Machine learning models trained on Earth observation data, such as satellite imagery, have demonstrated significant promise in predicting household-level wealth indices, enabling the creation of high-resolution wealth maps that can be leveraged across multiple causal trials. However, because standard training objectives prioritize overall predictive accuracy, these predictions inherently suffer from shrinkage toward the mean, leading to attenuated estimates of causal treatment effects and limiting their utility in policy. Existing debiasing methods, such as Prediction-Powered Inference, can handle this attenuation bias but require additional fresh ground-truth data at the downstream stage of causal inference, which restricts their applicability in data-scarce environments. Here, we introduce and evaluate two correction methods -- linear calibration correction and Tweedie's correction -- that substantially reduce prediction bias without relying on newly collected labeled data. Linear calibration corrects bias through a straightforward linear transformation derived from held-out calibration data, whereas Tweedie's correction leverages empirical Bayes principles to directly address shrinkage-induced biases by exploiting score functions derived from the model's learning patterns. Through analytical exercises and experiments using Demographic and Health Survey data, we demonstrate that the proposed methods meet or outperform existing approaches that either require (a) adjustments to training pipelines or (b) additional labeled data. These approaches may represent a promising avenue for improving the reliability of causal inference when direct outcome measures are limited or unavailable, enabling a "one map, many trials" paradigm where a single upstream data creation team produces predictions usable by many downstream teams across diverse ML pipelines.

Forging a Developed India: Growth Imperatives, Fiscal Sustainability, and Multilateral Partnerships for Viksit Bharat 2047

This paper examines the fiscal and macroeconomic strategies essential for transition of India to a high income economy by 2047, aligning with the vision of Viksit Bharat. A sustainable annual GDP growth rate of 7 to 8 percent is projected as necessary to achieve this milestone while maintaining fiscal prudence through a targeted deficit threshold below 3.5 percent of GDP. The study underscores the role of disciplined fiscal management in financing critical public investments in infrastructure, human capital development and technological innovation. Given constraints on domestic resource mobilization, the paper highlights the importance of multilateral financial institutions, including the World Bank, IMF and ADB, in expanding fiscal space in India through concessional financing, technical cooperation, and risk sharing mechanisms. Using econometric modeling and scenario analysis, the research identifies key policy interventions in infrastructure, healthcare, education and sustainable energy that can maximize growth while ensuring fiscal sustainability. Policy recommendations include enhancing tax buoyancy, rationalizing expenditure, optimizing public private partnerships and strengthening fiscal responsibility frameworks. The findings suggest that a calibrated approach to growth, prudent fiscal management and strategic international collaborations are critical to achieving long term economic aspirations of India.

  • 2 authors
·
Dec 1, 2025

Applying the Polynomial Maximization Method to Estimate ARIMA Models with Asymmetric Non-Gaussian Innovations

Classical estimators for ARIMA parameters (MLE, CSS, OLS) assume Gaussian innovations, an assumption frequently violated in financial and economic data exhibiting asymmetric distributions with heavy tails. We develop and validate the second-order polynomial maximization method (PMM2) for estimating ARIMA(p,d,q) models with non-Gaussian innovations. PMM2 is a semiparametric technique that exploits higher-order moments and cumulants without requiring full distributional specification. Monte Carlo experiments (128,000 simulations) across sample sizes N in {100, 200, 500, 1000} and four innovation distributions demonstrate that PMM2 substantially outperforms classical methods for asymmetric innovations. For ARIMA(1,1,0) with N=500, relative efficiency reaches 1.58--1.90 for Gamma, lognormal, and χ^2(3) innovations (37--47\% variance reduction). Under Gaussian innovations PMM2 matches OLS efficiency, avoiding the precision loss typical of robust estimators. The method delivers major gains for moderate asymmetry (|γ_3| geq 0.5) and N geq 200, with computational costs comparable to MLE. PMM2 provides an effective alternative for time series with asymmetric innovations typical of financial markets, macroeconomic indicators, and industrial measurements. Future extensions include seasonal SARIMA models, GARCH integration, and automatic order selection.

  • 1 authors
·
Nov 10, 2025 1

TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables

Deep models have demonstrated remarkable performance in time series forecasting. However, due to the partially-observed nature of real-world applications, solely focusing on the target of interest, so-called endogenous variables, is usually insufficient to guarantee accurate forecasting. Notably, a system is often recorded into multiple variables, where the exogenous variables can provide valuable external information for endogenous variables. Thus, unlike well-established multivariate or univariate forecasting paradigms that either treat all the variables equally or ignore exogenous information, this paper focuses on a more practical setting: time series forecasting with exogenous variables. We propose a novel approach, TimeXer, to ingest external information to enhance the forecasting of endogenous variables. With deftly designed embedding layers, TimeXer empowers the canonical Transformer with the ability to reconcile endogenous and exogenous information, where patch-wise self-attention and variate-wise cross-attention are used simultaneously. Moreover, global endogenous tokens are learned to effectively bridge the causal information underlying exogenous series into endogenous temporal patches. Experimentally, TimeXer achieves consistent state-of-the-art performance on twelve real-world forecasting benchmarks and exhibits notable generality and scalability. Code is available at this repository: https://github.com/thuml/TimeXer.

  • 9 authors
·
Feb 29, 2024

Revisiting Multivariate Time Series Forecasting with Missing Values

Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.

  • 7 authors
·
Sep 27, 2025

Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data

Learning causal structure from observational data often assumes that we observe independent and identically distributed (i.\,i.\,d) data. The traditional approach aims to find a graphical representation that encodes the same set of conditional independence relationships as those present in the observed distribution. It is known that under i.\,i.\,d assumption, even with infinite data, there is a limit to how fine-grained a causal structure we can identify. To overcome this limitation, recent work has explored using data originating from different, related environments to learn richer causal structure. These approaches implicitly rely on the independent causal mechanisms (ICM) principle, which postulates that the mechanism giving rise to an effect given its causes and the mechanism which generates the causes do not inform or influence each other. Thus, components of the causal model can independently change from environment to environment. Despite its wide application in machine learning and causal inference, there is a lack of statistical formalization of the ICM principle and how it enables identification of richer causal structures from grouped data. Here we present new causal de Finetti theorems which offer a first statistical formalization of ICM principle and show how causal structure identification is possible from exchangeable data. Our work provides theoretical justification for a broad range of techniques leveraging multi-environment data to learn causal structure.

  • 4 authors
·
Mar 29, 2022

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Out-of-distribution (OOD) prediction is often approached by restricting models to causal or invariant covariates, avoiding non-causal spurious associations that may be unstable across environments. Despite its theoretical appeal, this strategy frequently underperforms empirical risk minimization (ERM) in practice. We investigate the source of this gap and show that such failures naturally arise when only a subset of the true causes of the outcome is observed. In these settings, non-causal spurious covariates can serve as informative proxies for unobserved causes and substantially improve prediction, except under distribution shifts that break these proxy relationships. Consequently, the optimal set of predictive covariates is neither universal nor necessarily exhibits invariant relationships with the outcome across all environments, but instead depends on the specific type of shift encountered. Crucially, we observe that different covariate shifts induce distinct, observable signatures in the covariate distribution itself. Moreover, these signatures can be extracted from unlabeled data in the target OOD environment and used to assess when proxy covariates remain reliable and when they fail. Building on this observation, we propose an environment-adaptive covariate selection (EACS) algorithm that maps environment-level covariate summaries to environment-specific covariate sets, while allowing the incorporation of prior causal knowledge as constraints. Across simulations and applied datasets, EACS consistently outperforms static causal, invariant, and ERM-based predictors under diverse distribution shifts.

  • 2 authors
·
Jan 5

Temporal Causal-based Simulation for Realistic Time-series Generation

Causal Discovery plays a pivotal role in revealing relationships among observed variables, particularly in the temporal setup. While the majority of CD methods rely on synthetic data for evaluation, and recently for training, these fall short in accurately mirroring real-world scenarios; an effect even more evident in temporal data. Generation techniques depending on simplified assumptions on causal structure, effects and time, limit the quality and diversity of the simulated data. In this work, we introduce Temporal Causal-based Simulation (TCS), a robust framework for generating realistic time-series data and their associated temporal causal graphs. The approach is structured in three phases: estimating the true lagged causal structure of the data, approximating the functional dependencies between variables and learning the noise distribution of the corresponding causal model, each part of which can be explicitly tailored based on data assumptions and characteristics. Through an extensive evaluation process, we highlight that single detection methods for generated data discrimination prove inadequate, accentuating it as a multifaceted challenge. For this, we detail a Min-max optimization phase that draws on AutoML techniques. Our contributions include a flexible, model-agnostic pipeline for generating realistic temporal causal data, a thorough evaluation setup which enhances the validity of the generated datasets and insights into the challenges posed by realistic data generation. Through experiments involving not only real but also semi-synthetic and purely synthetic datasets, we demonstrate that while sampling realistic causal data remains a complex task, our method enriches the domain of generating sensible causal-based temporal data.

  • 7 authors
·
Jun 2, 2025

A study of a deterministic model for meningitis epidemic

A compartmental deterministic model that allows (1) immunity from two stages of infection and carriage, and (2) disease induced death, is used in studying the dynamics of meningitis epidemic process in a closed population. It allows for difference in the transmission rate of infection to a susceptible by a carrier and an infective. It is generalized to allow a proportion ({\phi}) of those susceptibles infected to progress directly to infectives in stage I. Both models are used in this study. The threshold conditions for the spread of carrier and infectives in stage I are derived for the two models. Sensitivity analysis is performed on the reproductive number derived from the next generation matrix. The case-carrier ratio profile for various parameters and threshold values are shown. So also are the graphs of the total number ever infected as influenced by {\epsilon} and {\phi}. The infection transmission rate (eta), the odds in favor of a carrier, over an infective, in transmitting an infection to a susceptible ({\epsilon}) and the carrier conversion rate ({\phi}) to an infective in stage I, are identified as key parameters that should be subject of attention for any control intervention strategy. The case-carrier ratio profiles provide evidence of a critical case-carrier ratio attained before the number of reported cases grows to an epidemic level. They also provide visual evidence of epidemiological context, in this case, epidemic incidence (in later part of dry season) and endemic incidence (during rainy season). Results from total proportion ever infected suggest that the model, in which {\phi}=0 obtained, can adequately represent, in essence, the generalized model for this study.

  • 2 authors
·
Mar 31, 2023

ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection

Anomaly detection in multivariate time series data is of paramount importance for ensuring the efficient operation of large-scale systems across diverse domains. However, accurately detecting anomalies in such data poses significant challenges. Existing approaches, including forecasting and reconstruction-based methods, struggle to address these challenges effectively. To overcome these limitations, we propose a novel anomaly detection framework named ImDiffusion, which combines time series imputation and diffusion models to achieve accurate and robust anomaly detection. The imputation-based approach employed by ImDiffusion leverages the information from neighboring values in the time series, enabling precise modeling of temporal and inter-correlated dependencies, reducing uncertainty in the data, thereby enhancing the robustness of the anomaly detection process. ImDiffusion further leverages diffusion models as time series imputers to accurately capturing complex dependencies. We leverage the step-by-step denoised outputs generated during the inference process to serve as valuable signals for anomaly prediction, resulting in improved accuracy and robustness of the detection process. We evaluate the performance of ImDiffusion via extensive experiments on benchmark datasets. The results demonstrate that our proposed framework significantly outperforms state-of-the-art approaches in terms of detection accuracy and timeliness. ImDiffusion is further integrated into the real production system in Microsoft and observe a remarkable 11.4% increase in detection F1 score compared to the legacy approach. To the best of our knowledge, ImDiffusion represents a pioneering approach that combines imputation-based techniques with time series anomaly detection, while introducing the novel use of diffusion models to the field.

  • 10 authors
·
Jul 3, 2023

BASIR: Budget-Assisted Sectoral Impact Ranking -- A Dataset for Sector Identification and Performance Prediction Using Language Models

Government fiscal policies, particularly annual union budgets, exert significant influence on financial markets. However, real-time analysis of budgetary impacts on sector-specific equity performance remains methodologically challenging and largely unexplored. This study proposes a framework to systematically identify and rank sectors poised to benefit from India's Union Budget announcements. The framework addresses two core tasks: (1) multi-label classification of excerpts from budget transcripts into 81 predefined economic sectors, and (2) performance ranking of these sectors. Leveraging a comprehensive corpus of Indian Union Budget transcripts from 1947 to 2025, we introduce BASIR (Budget-Assisted Sectoral Impact Ranking), an annotated dataset mapping excerpts from budgetary transcripts to sectoral impacts. Our architecture incorporates fine-tuned embeddings for sector identification, coupled with language models that rank sectors based on their predicted performances. Our results demonstrate 0.605 F1-score in sector classification, and 0.997 NDCG score in predicting ranks of sectors based on post-budget performances. The methodology enables investors and policymakers to quantify fiscal policy impacts through structured, data-driven insights, addressing critical gaps in manual analysis. The annotated dataset has been released under CC-BY-NC-SA-4.0 license to advance computational economics research.

  • 2 authors
·
Apr 2, 2025