new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

May 7

The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare ten public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question-answering (QA). For instance, across all tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 22.7% of cases, reach a (statistical) tie in 36.8% of cases, and are significantly worse than their base models in the remaining 40.5% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Meanwhile, we find that after fine-tuning on specific QA tasks, medical LLMs can show performance improvements, but the benefits do not carry over to tasks based on clinical notes. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

  • 5 authors
·
Nov 13, 2024

Beyond Binary Gender Labels: Revealing Gender Biases in LLMs through Gender-Neutral Name Predictions

Name-based gender prediction has traditionally categorized individuals as either female or male based on their names, using a binary classification system. That binary approach can be problematic in the cases of gender-neutral names that do not align with any one gender, among other reasons. Relying solely on binary gender categories without recognizing gender-neutral names can reduce the inclusiveness of gender prediction tasks. We introduce an additional gender category, i.e., "neutral", to study and address potential gender biases in Large Language Models (LLMs). We evaluate the performance of several foundational and large language models in predicting gender based on first names only. Additionally, we investigate the impact of adding birth years to enhance the accuracy of gender prediction, accounting for shifting associations between names and genders over time. Our findings indicate that most LLMs identify male and female names with high accuracy (over 80%) but struggle with gender-neutral names (under 40%), and the accuracy of gender prediction is higher for English-based first names than non-English names. The experimental results show that incorporating the birth year does not improve the overall accuracy of gender prediction, especially for names with evolving gender associations. We recommend using caution when applying LLMs for gender identification in downstream tasks, particularly when dealing with non-binary gender labels.

  • 7 authors
·
Jul 7, 2024

LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (i.e., without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.

apple Apple
·
Mar 11, 2025 1

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce an additional compact network, known as the Video Generation Transformer (VGT). This auxiliary component is designed to extract perturbations from the inherent information contained within the input, thereby refining inconsistent pixels during temporal predictions. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video. Experiments demonstrate a noticeable improvement in the consistency of the generated videos both qualitatively and quantitatively.

  • 5 authors
·
Aug 24, 2023

Bridging the Gap: Integrating Ethics and Environmental Sustainability in AI Research and Practice

As the possibilities for Artificial Intelligence (AI) have grown, so have concerns regarding its impacts on society and the environment. However, these issues are often raised separately; i.e. carbon footprint analyses of AI models typically do not consider how the pursuit of scale has contributed towards building models that are both inaccessible to most researchers in terms of cost and disproportionately harmful to the environment. On the other hand, model audits that aim to evaluate model performance and disparate impacts mostly fail to engage with the environmental ramifications of AI models and how these fit into their auditing approaches. In this separation, both research directions fail to capture the depth of analysis that can be explored by considering the two in parallel and the potential solutions for making informed choices that can be developed at their convergence. In this essay, we build upon work carried out in AI and in sister communities, such as philosophy and sustainable development, to make more deliberate connections around topics such as generalizability, transparency, evaluation and equity across AI research and practice. We argue that the efforts aiming to study AI's ethical ramifications should be made in tandem with those evaluating its impacts on the environment, and we conclude with a proposal of best practices to better integrate AI ethics and sustainability in AI research and practice.

  • 4 authors
·
Apr 1, 2025

Empirical Study of Market Impact Conditional on Order-Flow Imbalance

In this research, we have empirically investigated the key drivers affecting liquidity in equity markets. We illustrated how theoretical models, such as Kyle's model, of agents' interplay in the financial markets, are aligned with the phenomena observed in publicly available trades and quotes data. Specifically, we confirmed that for small signed order-flows, the price impact grows linearly with increase in the order-flow imbalance. We have, further, implemented a machine learning algorithm to forecast market impact given a signed order-flow. Our findings suggest that machine learning models can be used in estimation of financial variables; and predictive accuracy of such learning algorithms can surpass the performance of traditional statistical approaches. Understanding the determinants of price impact is crucial for several reasons. From a theoretical stance, modelling the impact provides a statistical measure of liquidity. Practitioners adopt impact models as a pre-trade tool to estimate expected transaction costs and optimize the execution of their strategies. This further serves as a post-trade valuation benchmark as suboptimal execution can significantly deteriorate a portfolio performance. More broadly, the price impact reflects the balance of liquidity across markets. This is of central importance to regulators as it provides an all-encompassing explanation of the correlation between market design and systemic risk, enabling regulators to design more stable and efficient markets.

  • 1 authors
·
Apr 17, 2020