new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 17

SSL4Eco: A Global Seasonal Dataset for Geospatial Foundation Models in Ecology

With the exacerbation of the biodiversity and climate crises, macroecological pursuits such as global biodiversity mapping become more urgent. Remote sensing offers a wealth of Earth observation data for ecological studies, but the scarcity of labeled datasets remains a major challenge. Recently, self-supervised learning has enabled learning representations from unlabeled data, triggering the development of pretrained geospatial models with generalizable features. However, these models are often trained on datasets biased toward areas of high human activity, leaving entire ecological regions underrepresented. Additionally, while some datasets attempt to address seasonality through multi-date imagery, they typically follow calendar seasons rather than local phenological cycles. To better capture vegetation seasonality at a global scale, we propose a simple phenology-informed sampling strategy and introduce corresponding SSL4Eco, a multi-date Sentinel-2 dataset, on which we train an existing model with a season-contrastive objective. We compare representations learned from SSL4Eco against other datasets on diverse ecological downstream tasks and demonstrate that our straightforward sampling method consistently improves representation quality, highlighting the importance of dataset construction. The model pretrained on SSL4Eco reaches state of the art performance on 7 out of 8 downstream tasks spanning (multi-label) classification and regression. We release our code, data, and model weights to support macroecological and computer vision research at https://github.com/PlekhanovaElena/ssl4eco.

  • 7 authors
·
Apr 25, 2025

BioAnalyst: A Foundation Model for Biodiversity

The accelerating loss of biodiversity presents critical challenges for ecological research and conservation strategies. The preservation of biodiversity is paramount for maintaining ecological balance and ensuring the sustainability of ecosystems. However, biodiversity faces numerous threats, including habitat loss, climate change, and the proliferation of invasive species. Addressing these and other ecology-related challenges, both at local and global scales, requires comprehensive monitoring, predictive and conservation planning capabilities. Artificial Intelligence (AI) Foundation Models (FMs) have gained significant momentum in numerous scientific domains by leveraging vast datasets to learn general-purpose representations adaptable to various downstream tasks. This paradigm holds immense promise for biodiversity conservation. In response, we introduce BioAnalyst, the first Foundation Model tailored for biodiversity analysis and conservation planning. BioAnalyst employs a transformer-based architecture, pre-trained on extensive multi-modal datasets encompassing species occurrence records, remote sensing indicators, climate and environmental variables. BioAnalyst is designed for adaptability, allowing for fine-tuning of a range of downstream tasks, such as species distribution modelling, habitat suitability assessments, invasive species detection, and population trend forecasting. We evaluate the model's performance on two downstream use cases, demonstrating its generalisability compared to existing methods, particularly in data-scarce scenarios for two distinct use-cases, establishing a new accuracy baseline for ecological forecasting. By openly releasing BioAnalyst and its fine-tuning workflows to the scientific community, we aim to foster collaborative efforts in biodiversity modelling and advance AI-driven solutions to pressing ecological challenges.

  • 7 authors
·
Jul 11, 2025

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Predictable behavior from scaling advanced AI systems is an extremely desirable property. Although a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities scale is significantly muddier. In this work, we take a step back and ask: why has predicting specific downstream capabilities with scale remained elusive? While many factors are certainly responsible, we identify a new factor that makes modeling scaling behavior on widely used multiple-choice question-answering benchmarks challenging. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrade the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for incorrect choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

  • 9 authors
·
Jun 6, 2024

EcoVerse: An Annotated Twitter Dataset for Eco-Relevance Classification, Environmental Impact Analysis, and Stance Detection

Anthropogenic ecological crisis constitutes a significant challenge that all within the academy must urgently face, including the Natural Language Processing (NLP) community. While recent years have seen increasing work revolving around climate-centric discourse, crucial environmental and ecological topics outside of climate change remain largely unaddressed, despite their prominent importance. Mainstream NLP tasks, such as sentiment analysis, dominate the scene, but there remains an untouched space in the literature involving the analysis of environmental impacts of certain events and practices. To address this gap, this paper presents EcoVerse, an annotated English Twitter dataset of 3,023 tweets spanning a wide spectrum of environmental topics. We propose a three-level annotation scheme designed for Eco-Relevance Classification, Stance Detection, and introducing an original approach for Environmental Impact Analysis. We detail the data collection, filtering, and labeling process that led to the creation of the dataset. Remarkable Inter-Annotator Agreement indicates that the annotation scheme produces consistent annotations of high quality. Subsequent classification experiments using BERT-based models, including ClimateBERT, are presented. These yield encouraging results, while also indicating room for a model specifically tailored for environmental texts. The dataset is made freely available to stimulate further research.

  • 4 authors
·
Apr 7, 2024

Causal Attribution of Coastal Water Clarity Degradation to Nickel Processing Expansion at the Indonesia Morowali Industrial Park, Sulawesi

Indonesia's nickel ore export ban has driven rapid expansion of smelting and hydrometallurgical processing capacity at the Indonesia Morowali Industrial Park (IMIP), now the world's largest integrated nickel processing complex, on the coast of Central Sulawesi. Whether this industrialization has degraded the adjacent marine environment remains unquantified. We apply Bayesian structural time-series (BSTS) causal inference to a multi-decadal, multi-sensor satellite ocean color record of the diffuse attenuation coefficient at 490 nm, K_d(490), to test for a causal link between IMIP expansion and nearshore turbidity change. A consensus structural breakpoint, a significant posterior causal effect estimated against a Banda Sea counterfactual, and a distribution-free placebo rank test collectively establish that coastal water clarity deteriorated after the transition from initial nickel pig iron production to hyper-expansion of high-pressure acid leaching facilities for battery-grade nickel. Satellite-derived land cover analysis independently corroborates this timing, showing substantial built-area growth and concurrent tree cover loss within the IMIP footprint. The resulting euphotic zone shoaling occurs in oligotrophic waters supporting high marine biodiversity, where even moderate optical degradation may impair coral photosynthesis and compress depth-dependent reef habitat. These findings quantify a marine environmental cost absent from Indonesia's mineral downstreaming policy discourse and demonstrate a transferable, satellite-based quasi-experimental framework for causal impact assessment at coastal industrial sites in data-limited tropical settings.

Provable Benefits of Multi-task RL under Non-Markovian Decision Making Processes

In multi-task reinforcement learning (RL) under Markov decision processes (MDPs), the presence of shared latent structures among multiple MDPs has been shown to yield significant benefits to the sample efficiency compared to single-task RL. In this paper, we investigate whether such a benefit can extend to more general sequential decision making problems, such as partially observable MDPs (POMDPs) and more general predictive state representations (PSRs). The main challenge here is that the large and complex model space makes it hard to identify what types of common latent structure of multi-task PSRs can reduce the model complexity and improve sample efficiency. To this end, we posit a joint model class for tasks and use the notion of eta-bracketing number to quantify its complexity; this number also serves as a general metric to capture the similarity of tasks and thus determines the benefit of multi-task over single-task RL. We first study upstream multi-task learning over PSRs, in which all tasks share the same observation and action spaces. We propose a provably efficient algorithm UMT-PSR for finding near-optimal policies for all PSRs, and demonstrate that the advantage of multi-task learning manifests if the joint model class of PSRs has a smaller eta-bracketing number compared to that of individual single-task learning. We also provide several example multi-task PSRs with small eta-bracketing numbers, which reap the benefits of multi-task learning. We further investigate downstream learning, in which the agent needs to learn a new target task that shares some commonalities with the upstream tasks via a similarity constraint. By exploiting the learned PSRs from the upstream, we develop a sample-efficient algorithm that provably finds a near-optimal policy.

  • 5 authors
·
Oct 20, 2023

HIVEX: A High-Impact Environment Suite for Multi-Agent Research (extended version)

Games have been vital test beds for the rapid development of Agent-based research. Remarkable progress has been achieved in the past, but it is unclear if the findings equip for real-world problems. While pressure grows, some of the most critical ecological challenges can find mitigation and prevention solutions through technology and its applications. Most real-world domains include multi-agent scenarios and require machine-machine and human-machine collaboration. Open-source environments have not advanced and are often toy scenarios, too abstract or not suitable for multi-agent research. By mimicking real-world problems and increasing the complexity of environments, we hope to advance state-of-the-art multi-agent research and inspire researchers to work on immediate real-world problems. Here, we present HIVEX, an environment suite to benchmark multi-agent research focusing on ecological challenges. HIVEX includes the following environments: Wind Farm Control, Wildfire Resource Management, Drone-Based Reforestation, Ocean Plastic Collection, and Aerial Wildfire Suppression. We provide environments, training examples, and baselines for the main and sub-tasks. All trained models resulting from the experiments of this work are hosted on Hugging Face. We also provide a leaderboard on Hugging Face and encourage the community to submit models trained on our environment suite.

  • 1 authors
·
Jan 7, 2025

The MineRL BASALT Competition on Learning from Human Feedback

The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-defined specification? While multiple solutions have been proposed, in this competition we focus on one in particular: learning from human feedback. Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve. The MineRL BASALT competition aims to spur forward research on this important class of techniques. We design a suite of four tasks in Minecraft for which we expect it will be hard to write down hardcoded reward functions. These tasks are defined by a paragraph of natural language: for example, "create a waterfall and take a scenic picture of it", with additional clarifying details. Participants must train a separate agent for each task, using any method they want. Agents are then evaluated by humans who have read the task description. To help participants get started, we provide a dataset of human demonstrations on each of the four tasks, as well as an imitation learning baseline that leverages these demonstrations. Our hope is that this competition will improve our ability to build AI systems that do what their designers intend them to do, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on the value alignment problem.

  • 13 authors
·
Jul 5, 2021

WHOI-Plankton- A Large Scale Fine Grained Visual Recognition Benchmark Dataset for Plankton Classification

Planktonic organisms are of fundamental importance to marine ecosystems: they form the basis of the food web, provide the link between the atmosphere and the deep ocean, and influence global-scale biogeochemical cycles. Scientists are increasingly using imaging-based technologies to study these creatures in their natural habit. Images from such systems provide an unique opportunity to model and understand plankton ecosystems, but the collected datasets can be enormous. The Imaging FlowCytobot (IFCB) at Woods Hole Oceanographic Institution, for example, is an in situ system that has been continuously imaging plankton since 2006. To date, it has generated more than 700 million samples. Manual classification of such a vast image collection is impractical due to the size of the data set. In addition, the annotation task is challenging due to the large space of relevant classes, intra-class variability, and inter-class similarity. Methods for automated classification exist, but the accuracy is often below that of human experts. Here we introduce WHOI-Plankton: a large scale, fine-grained visual recognition dataset for plankton classification, which comprises over 3.4 million expert-labeled images across 70 classes. The labeled image set is complied from over 8 years of near continuous data collection with the IFCB at the Martha's Vineyard Coastal Observatory (MVCO). We discuss relevant metrics for evaluation of classification performance and provide results for a traditional method based on hand-engineered features and two methods based on convolutional neural networks.

  • 4 authors
·
Oct 2, 2015

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

Artificial Intelligence (AI) has made incredible progress recently. On the one hand, advanced foundation models like ChatGPT can offer powerful conversation, in-context learning and code generation abilities on a broad range of open-domain tasks. They can also generate high-level solution outlines for domain-specific tasks based on the common sense knowledge they have acquired. However, they still face difficulties with some specialized tasks because they lack enough domain-specific data during pre-training or they often have errors in their neural network computations on those tasks that need accurate executions. On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well. However, due to the different implementation or working mechanisms, they are not easily accessible or compatible with foundation models. Therefore, there is a clear and pressing need for a mechanism that can leverage foundation models to propose task solution outlines and then automatically match some of the sub-tasks in the outlines to the off-the-shelf models and systems with special functionalities to complete them. Inspired by this, we introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. Unlike most previous work that aimed to improve a single AI model, TaskMatrix.AI focuses more on using existing foundation models (as a brain-like central system) and APIs of other AI models and systems (as sub-task solvers) to achieve diversified tasks in both digital and physical domains. As a position paper, we will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.

  • 14 authors
·
Mar 28, 2023

Learning to Communicate and Collaborate in a Competitive Multi-Agent Setup to Clean the Ocean from Macroplastics

Finding a balance between collaboration and competition is crucial for artificial agents in many real-world applications. We investigate this using a Multi-Agent Reinforcement Learning (MARL) setup on the back of a high-impact problem. The accumulation and yearly growth of plastic in the ocean cause irreparable damage to many aspects of oceanic health and the marina system. To prevent further damage, we need to find ways to reduce macroplastics from known plastic patches in the ocean. Here we propose a Graph Neural Network (GNN) based communication mechanism that increases the agents' observation space. In our custom environment, agents control a plastic collecting vessel. The communication mechanism enables agents to develop a communication protocol using a binary signal. While the goal of the agent collective is to clean up as much as possible, agents are rewarded for the individual amount of macroplastics collected. Hence agents have to learn to communicate effectively while maintaining high individual performance. We compare our proposed communication mechanism with a multi-agent baseline without the ability to communicate. Results show communication enables collaboration and increases collective performance significantly. This means agents have learned the importance of communication and found a balance between collaboration and competition.

  • 1 authors
·
Nov 5, 2024

A game-theoretic analysis of networked system control for common-pool resource management using multi-agent reinforcement learning

Multi-agent reinforcement learning has recently shown great promise as an approach to networked system control. Arguably, one of the most difficult and important tasks for which large scale networked system control is applicable is common-pool resource management. Crucial common-pool resources include arable land, fresh water, wetlands, wildlife, fish stock, forests and the atmosphere, of which proper management is related to some of society's greatest challenges such as food security, inequality and climate change. Here we take inspiration from a recent research program investigating the game-theoretic incentives of humans in social dilemma situations such as the well-known tragedy of the commons. However, instead of focusing on biologically evolved human-like agents, our concern is rather to better understand the learning and operating behaviour of engineered networked systems comprising general-purpose reinforcement learning agents, subject only to nonbiological constraints such as memory, computation and communication bandwidth. Harnessing tools from empirical game-theoretic analysis, we analyse the differences in resulting solution concepts that stem from employing different information structures in the design of networked multi-agent systems. These information structures pertain to the type of information shared between agents as well as the employed communication protocol and network topology. Our analysis contributes new insights into the consequences associated with certain design choices and provides an additional dimension of comparison between systems beyond efficiency, robustness, scalability and mean control performance.

  • 9 authors
·
Oct 15, 2020

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.

  • 6 authors
·
May 3, 2024

Understanding accountability in algorithmic supply chains

Academic and policy proposals on algorithmic accountability often seek to understand algorithmic systems in their socio-technical context, recognising that they are produced by 'many hands'. Increasingly, however, algorithmic systems are also produced, deployed, and used within a supply chain comprising multiple actors tied together by flows of data between them. In such cases, it is the working together of an algorithmic supply chain of different actors who contribute to the production, deployment, use, and functionality that drives systems and produces particular outcomes. We argue that algorithmic accountability discussions must consider supply chains and the difficult implications they raise for the governance and accountability of algorithmic systems. In doing so, we explore algorithmic supply chains, locating them in their broader technical and political economic context and identifying some key features that should be understood in future work on algorithmic governance and accountability (particularly regarding general purpose AI services). To highlight ways forward and areas warranting attention, we further discuss some implications raised by supply chains: challenges for allocating accountability stemming from distributed responsibility for systems between actors, limited visibility due to the accountability horizon, service models of use and liability, and cross-border supply chains and regulatory arbitrage

  • 3 authors
·
Apr 28, 2023

Exploring the sustainable scaling of AI dilemma: A projective study of corporations' AI environmental impacts

The rapid growth of artificial intelligence (AI), particularly Large Language Models (LLMs), has raised concerns regarding its global environmental impact that extends beyond greenhouse gas emissions to include consideration of hardware fabrication and end-of-life processes. The opacity from major providers hinders companies' abilities to evaluate their AI-related environmental impacts and achieve net-zero targets. In this paper, we propose a methodology to estimate the environmental impact of a company's AI portfolio, providing actionable insights without necessitating extensive AI and Life-Cycle Assessment (LCA) expertise. Results confirm that large generative AI models consume up to 4600x more energy than traditional models. Our modelling approach, which accounts for increased AI usage, hardware computing efficiency, and changes in electricity mix in line with IPCC scenarios, forecasts AI electricity use up to 2030. Under a high adoption scenario, driven by widespread Generative AI and agents adoption associated to increasingly complex models and frameworks, AI electricity use is projected to rise by a factor of 24.4. Mitigating the environmental impact of Generative AI by 2030 requires coordinated efforts across the AI value chain. Isolated measures in hardware efficiency, model efficiency, or grid improvements alone are insufficient. We advocate for standardized environmental assessment frameworks, greater transparency from the all actors of the value chain and the introduction of a "Return on Environment" metric to align AI development with net-zero goals.

  • 6 authors
·
Jan 24, 2025 3

Early warning signals: The charted and uncharted territories

The realization that complex systems such as ecological communities can collapse or shift regimes suddenly and without rapid external forcing poses a serious challenge to our understanding and management of the natural world. The potential to identify early warning signals that would allow researchers and managers to predict such events before they happen has therefore been an invaluable discovery that offers a way forward in spite of such seemingly unpredictable behavior. Research into early warning signals has demonstrated that it is possible to define and detect such early warning signals in advance of a transition in certain contexts. Here we describe the pattern emerging as research continues to explore just how far we can generalize these results. A core of examples emerges that shares three properties: the phenomenon of rapid regime shifts, a pattern of 'critical slowing down' that can be used to detect the approaching shift, and a mechanism of bifurcation driving the sudden change. As research has expanded beyond these core examples, it is becoming clear that not all systems that show regime shifts exhibit critical slowing down, or vice versa. Even when systems exhibit critical slowing down, statistical detection is a challenge. We review the literature that explores these edge cases and highlight the need for (a) new early warning behaviors that can be used in cases where rapid shifts do not exhibit critical slowing down, (b) the development of methods to identify which behavior might be an appropriate signal when encountering a novel system; bearing in mind that a positive indication for some systems is a negative indication in others, and (c) statistical methods that can distinguish between signatures of early warning behaviors and noise.

  • 3 authors
·
May 29, 2013

WebArena: A Realistic Web Environment for Building Autonomous Agents

With generative AI advances, the exciting potential for autonomous agents to manage daily tasks via natural language commands has emerged. However, cur rent agents are primarily created and tested in simplified synthetic environments, substantially limiting real-world scenario representation. In this paper, we build an environment for agent command and control that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on websites, and we create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and are designed to emulate tasks that humans routinely perform on the internet. We design and implement several autonomous agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 10.59%. These results highlight the need for further development of robust agents, that current state-of-the-art LMs are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress. Our code, data, environment reproduction resources, and video demonstrations are publicly available at https://webarena.dev/.

  • 11 authors
·
Jul 25, 2023 4

Reporting and Analysing the Environmental Impact of Language Models on the Example of Commonsense Question Answering with External Knowledge

Human-produced emissions are growing at an alarming rate, causing already observable changes in the climate and environment in general. Each year global carbon dioxide emissions hit a new record, and it is reported that 0.5% of total US greenhouse gas emissions are attributed to data centres as of 2021. The release of ChatGPT in late 2022 sparked social interest in Large Language Models (LLMs), the new generation of Language Models with a large number of parameters and trained on massive amounts of data. Currently, numerous companies are releasing products featuring various LLMs, with many more models in development and awaiting release. Deep Learning research is a competitive field, with only models that reach top performance attracting attention and being utilized. Hence, achieving better accuracy and results is often the first priority, while the model's efficiency and the environmental impact of the study are neglected. However, LLMs demand substantial computational resources and are very costly to train, both financially and environmentally. It becomes essential to raise awareness and promote conscious decisions about algorithmic and hardware choices. Providing information on training time, the approximate carbon dioxide emissions and power consumption would assist future studies in making necessary adjustments and determining the compatibility of available computational resources with model requirements. In this study, we infused T5 LLM with external knowledge and fine-tuned the model for Question-Answering task. Furthermore, we calculated and reported the approximate environmental impact for both steps. The findings demonstrate that the smaller models may not always be sustainable options, and increased training does not always imply better performance. The most optimal outcome is achieved by carefully considering both performance and efficiency factors.

  • 4 authors
·
Jul 24, 2024

The Influence of Neural Networks on Hydropower Plant Management in Agriculture: Addressing Challenges and Exploring Untapped Opportunities

Hydropower plants are crucial for stable renewable energy and serve as vital water sources for sustainable agriculture. However, it is essential to assess the current water management practices associated with hydropower plant management software. A key concern is the potential conflict between electricity generation and agricultural water needs. Prioritising water for electricity generation can reduce irrigation availability in agriculture during crucial periods like droughts, impacting crop yields and regional food security. Coordination between electricity and agricultural water allocation is necessary to ensure optimal and environmentally sound practices. Neural networks have become valuable tools for hydropower plant management, but their black-box nature raises concerns about transparency in decision making. Additionally, current approaches often do not take advantage of their potential to create a system that effectively balances water allocation. This work is a call for attention and highlights the potential risks of deploying neural network-based hydropower plant management software without proper scrutiny and control. To address these concerns, we propose the adoption of the Agriculture Conscious Hydropower Plant Management framework, aiming to maximise electricity production while prioritising stable irrigation for agriculture. We also advocate reevaluating government-imposed minimum water guidelines for irrigation to ensure flexibility and effective water allocation. Additionally, we suggest a set of regulatory measures to promote model transparency and robustness, certifying software that makes conscious and intelligent water allocation decisions, ultimately safeguarding agriculture from undue strain during droughts.

  • 3 authors
·
Nov 21, 2023

MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows

Scientific innovation relies on detailed workflows, which include critical steps such as analyzing literature, generating ideas, validating these ideas, interpreting results, and inspiring follow-up research. However, scientific publications that document these workflows are extensive and unstructured. This makes it difficult for both human researchers and AI systems to effectively navigate and explore the space of scientific innovation. To address this issue, we introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of Scientific Workflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications -- context, key idea, method, outcome, and projected impact -- which correspond to five key steps in the research workflow. These structured summaries facilitate a variety of downstream tasks and analyses. The quality of the LLM-extracted summaries is validated by comparing them with human annotations. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset, which make various types of predictions and recommendations along the scientific workflow. MASSW holds significant potential for researchers to create and benchmark new AI methods for optimizing scientific workflows and fostering scientific innovation in the field. Our dataset is openly available at https://github.com/xingjian-zhang/massw.

  • 11 authors
·
Jun 10, 2024

Rethinking Supervised Pre-training for Better Downstream Transferring

The pretrain-finetune paradigm has shown outstanding performance on many applications of deep learning, where a model is pre-trained on a upstream large dataset (e.g. ImageNet), and is then fine-tuned to different downstream tasks. Though for most cases, the pre-training stage is conducted based on supervised methods, recent works on self-supervised pre-training have shown powerful transferability and even outperform supervised pre-training on multiple downstream tasks. It thus remains an open question how to better generalize supervised pre-training model to downstream tasks. In this paper, we argue that the worse transferability of existing supervised pre-training methods arise from the negligence of valuable intra-class semantic difference. This is because these methods tend to push images from the same class close to each other despite of the large diversity in their visual contents, a problem to which referred as "overfit of upstream tasks". To alleviate this problem, we propose a new supervised pre-training method based on Leave-One-Out K-Nearest-Neighbor, or LOOK for short. It relieves the problem of overfitting upstream tasks by only requiring each image to share its class label with most of its k nearest neighbors, thus allowing each class to exhibit a multi-mode distribution and consequentially preserving part of intra-class difference for better transferring to downstream tasks. We developed efficient implementation of the proposed method that scales well to large datasets. Experimental studies on multiple downstream tasks show that LOOK outperforms other state-of-the-art methods for supervised and self-supervised pre-training.

  • 5 authors
·
Oct 12, 2021

Bridging the Gap: Integrating Ethics and Environmental Sustainability in AI Research and Practice

As the possibilities for Artificial Intelligence (AI) have grown, so have concerns regarding its impacts on society and the environment. However, these issues are often raised separately; i.e. carbon footprint analyses of AI models typically do not consider how the pursuit of scale has contributed towards building models that are both inaccessible to most researchers in terms of cost and disproportionately harmful to the environment. On the other hand, model audits that aim to evaluate model performance and disparate impacts mostly fail to engage with the environmental ramifications of AI models and how these fit into their auditing approaches. In this separation, both research directions fail to capture the depth of analysis that can be explored by considering the two in parallel and the potential solutions for making informed choices that can be developed at their convergence. In this essay, we build upon work carried out in AI and in sister communities, such as philosophy and sustainable development, to make more deliberate connections around topics such as generalizability, transparency, evaluation and equity across AI research and practice. We argue that the efforts aiming to study AI's ethical ramifications should be made in tandem with those evaluating its impacts on the environment, and we conclude with a proposal of best practices to better integrate AI ethics and sustainability in AI research and practice.

  • 4 authors
·
Apr 1, 2025

Green Algorithms: Quantifying the carbon footprint of computation

Climate change is profoundly affecting nearly all aspects of life on earth, including human societies, economies and health. Various human activities are responsible for significant greenhouse gas emissions, including data centres and other sources of large-scale computation. Although many important scientific milestones have been achieved thanks to the development of high-performance computing, the resultant environmental impact has been underappreciated. In this paper, we present a methodological framework to estimate the carbon footprint of any computational task in a standardised and reliable way, based on the processing time, type of computing cores, memory available and the efficiency and location of the computing facility. Metrics to interpret and contextualise greenhouse gas emissions are defined, including the equivalent distance travelled by car or plane as well as the number of tree-months necessary for carbon sequestration. We develop a freely available online tool, Green Algorithms (www.green-algorithms.org), which enables a user to estimate and report the carbon footprint of their computation. The Green Algorithms tool easily integrates with computational processes as it requires minimal information and does not interfere with existing code, while also accounting for a broad range of CPUs, GPUs, cloud computing, local servers and desktop computers. Finally, by applying Green Algorithms, we quantify the greenhouse gas emissions of algorithms used for particle physics simulations, weather forecasts and natural language processing. Taken together, this study develops a simple generalisable framework and freely available tool to quantify the carbon footprint of nearly any computation. Combined with a series of recommendations to minimise unnecessary CO2 emissions, we hope to raise awareness and facilitate greener computation.

  • 3 authors
·
Jul 15, 2020

More than Carbon: Cradle-to-Grave environmental impacts of GenAI training on the Nvidia A100 GPU

The rapid expansion of AI has intensified concerns about its environmental sustainability. Yet, current assessments predominantly focus on operational carbon emissions using secondary data or estimated values, overlooking environmental impacts in other life cycle stages. This study presents the first comprehensive multi-criteria life cycle assessment (LCA) of AI training, examining 16 environmental impact categories based on detailed primary data collection of the Nvidia A100 SXM 40GB GPU. The LCA results for training BLOOM reveal that the use phase dominates 11 of 16 impact categories including climate change (96\%), while manufacturing dominates the remaining 5 impact categories including human toxicity, cancer (99\%) and mineral and metal depletion (85\%). For training GPT-4, the use phase dominates 10 of 16 impact categories, contributing about 96\% to both the climate change and resource use, fossils category. The manufacturing stage dominates 6 of 16 impact categories including human toxicity, cancer (94\%) and eutrophication, freshwater (81\%). Assessing the cradle-to-gate environmental impact distribution across the GPU components reveals that the GPU chip is the largest contributor across 10 of 16 of impact categories and shows particularly pronounced contributions to climate change (81\%) and resource use, fossils (80\%). While primary data collection results in modest changes in carbon estimates compared to database-derived estimates, substantial variations emerge in other categories. Most notably, minerals and metals depletion increases by 33\%, demonstrating the critical importance of primary data for non-carbon accounting. This multi-criteria analysis expands the Sustainable AI discourse beyond operational carbon emissions, challenging current sustainability narratives and highlighting the need for policy frameworks addressing the full spectrum of AI's environmental impact.

  • 8 authors
·
Aug 27, 2025

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

The rapid advancements in computing dramatically increase the scale and cost of training Large Language Models (LLMs). Accurately predicting downstream task performance prior to model training is crucial for efficient resource allocation, yet remains challenging due to two primary constraints: (1) the "emergence phenomenon", wherein downstream performance metrics become meaningful only after extensive training, which limits the ability to use smaller models for prediction; (2) Uneven task difficulty distributions and the absence of consistent scaling laws, resulting in substantial metric variability. Existing performance prediction methods suffer from limited accuracy and reliability, thereby impeding the assessment of potential LLM capabilities. To address these challenges, we propose a Clustering-On-Difficulty (COD) downstream performance prediction framework. COD first constructs a predictable support subset by clustering tasks based on difficulty features, strategically excluding non-emergent and non-scalable clusters. The scores on the selected subset serve as effective intermediate predictors of downstream performance on the full evaluation set. With theoretical support, we derive a mapping function that transforms performance metrics from the predictable subset to the full evaluation set, thereby ensuring accurate extrapolation of LLM downstream performance. The proposed method has been applied to predict performance scaling for a 70B LLM, providing actionable insights for training resource allocation and assisting in monitoring the training process. Notably, COD achieves remarkable predictive accuracy on the 70B LLM by leveraging an ensemble of small models, demonstrating an absolute mean deviation of 1.36% across eight important LLM evaluation benchmarks.

  • 5 authors
·
Feb 24, 2025 2

An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training

We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently. Despite considerable progress in multi-task learning, most efforts focus on learning from multi-label data: a single image set with multiple task labels. Such multi-label data sets are rare, small, and expensive. We say heterogeneous to refer to image sets with different task labels, or to combinations of single-task datasets. Few have explored training on such heterogeneous datasets. General-purpose vision models are still dominated by single-task pretraining, and it remains unclear how to scale up multi-task models by leveraging mainstream vision datasets designed for different purposes. The challenges lie in managing large intrinsic differences among vision tasks, including data distribution, architectures, task-specific modules, dataset scales, and sampling strategies. To address these challenges, we propose to modify and scale up mixture-of-experts (MoE) vision transformers, so that they can simultaneously learn classification, detection, and segmentation on diverse mainstream vision datasets including ImageNet, COCO, and ADE20K. Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks. Due to its emergent modularity, this general-purpose model decomposes into high-performing components, efficiently adapting to downstream tasks. We can fine-tune it with fewer training parameters, fewer model parameters, and less computation. Additionally, its modularity allows for easy expansion in continual-learning-without-forgetting scenarios. Finally, these functions can be controlled and combined to meet various demands of downstream tasks.

  • 7 authors
·
Jun 29, 2023

When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.

  • 10 authors
·
Apr 24, 2025

GeoPlant: Spatial Plant Species Prediction Dataset

The difficulty of monitoring biodiversity at fine scales and over large areas limits ecological knowledge and conservation efforts. To fill this gap, Species Distribution Models (SDMs) predict species across space from spatially explicit features. Yet, they face the challenge of integrating the rich but heterogeneous data made available over the past decade, notably millions of opportunistic species observations and standardized surveys, as well as multi-modal remote sensing data. In light of that, we have designed and developed a new European-scale dataset for SDMs at high spatial resolution (10-50 m), including more than 10k species (i.e., most of the European flora). The dataset comprises 5M heterogeneous Presence-Only records and 90k exhaustive Presence-Absence survey records, all accompanied by diverse environmental rasters (e.g., elevation, human footprint, and soil) that are traditionally used in SDMs. In addition, it provides Sentinel-2 RGB and NIR satellite images with 10 m resolution, a 20-year time-series of climatic variables, and satellite time-series from the Landsat program. In addition to the data, we provide an openly accessible SDM benchmark (hosted on Kaggle), which has already attracted an active community and a set of strong baselines for single predictor/modality and multimodal approaches. All resources, e.g., the dataset, pre-trained models, and baseline methods (in the form of notebooks), are available on Kaggle, allowing one to start with our dataset literally with two mouse clicks.

  • 10 authors
·
Aug 25, 2024

METER-ML: A Multi-Sensor Earth Observation Benchmark for Automated Methane Source Mapping

Reducing methane emissions is essential for mitigating global warming. To attribute methane emissions to their sources, a comprehensive dataset of methane source infrastructure is necessary. Recent advancements with deep learning on remotely sensed imagery have the potential to identify the locations and characteristics of methane sources, but there is a substantial lack of publicly available data to enable machine learning researchers and practitioners to build automated mapping approaches. To help fill this gap, we construct a multi-sensor dataset called METER-ML containing 86,599 georeferenced NAIP, Sentinel-1, and Sentinel-2 images in the U.S. labeled for the presence or absence of methane source facilities including concentrated animal feeding operations, coal mines, landfills, natural gas processing plants, oil refineries and petroleum terminals, and wastewater treatment plants. We experiment with a variety of models that leverage different spatial resolutions, spatial footprints, image products, and spectral bands. We find that our best model achieves an area under the precision recall curve of 0.915 for identifying concentrated animal feeding operations and 0.821 for oil refineries and petroleum terminals on an expert-labeled test set, suggesting the potential for large-scale mapping. We make METER-ML freely available at https://stanfordmlgroup.github.io/projects/meter-ml/ to support future work on automated methane source mapping.

  • 10 authors
·
Jul 22, 2022

Gravity-Informed Deep Learning Framework for Predicting Ship Traffic Flow and Invasion Risk of Non-Indigenous Species via Ballast Water Discharge

Invasive species in water bodies pose a major threat to the environment and biodiversity globally. Due to increased transportation and trade, non-native species have been introduced to new environments, causing damage to ecosystems and leading to economic losses in agriculture, forestry, and fisheries. Therefore, there is a pressing need for risk assessment and management techniques to mitigate the impact of these invasions. This study aims to develop a new physics-inspired model to forecast maritime shipping traffic and thus inform risk assessment of invasive species spread through global transportation networks. Inspired by the gravity model for international trades, our model considers various factors that influence the likelihood and impact of vessel activities, such as shipping flux density, distance between ports, trade flow, and centrality measures of transportation hubs. Additionally, by analyzing the risk network of invasive species, we provide a comprehensive framework for assessing the invasion threat level given a pair of origin and destination. Accordingly, this paper introduces transformers to gravity models to rebuild the short- and long-term dependencies that make the risk analysis feasible. Thus, we introduce a physics-inspired framework that achieves an 89% segmentation accuracy for existing and non-existing trajectories and an 84.8% accuracy for the number of vessels flowing between key port areas, representing more than 10% improvement over the traditional deep-gravity model. Along these lines, this research contributes to a better understanding of invasive species risk assessment. It allows policymakers, conservationists, and stakeholders to prioritize management actions by identifying high-risk invasion pathways. Besides, our model is versatile and can include new data sources, making it suitable for assessing species invasion risks in a changing global landscape.

  • 6 authors
·
Jan 23, 2024

Air Quality and Greenhouse Gas Emissions Assessment of Data Centers in Texas: Quantifying Impacts and Environmental Tradeoffs

This study assesses air quality (AQ) and greenhouse gas (GHG) emissions from the rapid expansion of data centers in Texas, a major hub due to infrastructure, electricity markets, and business conditions. AQ impacts were separated from GHG emissions to clarify sources, regulations, and mitigation strategies. Electricity consumption and cooling systems dominate GHG emissions, with a 10 megawatt data center generating about 37,668 metric tons CO2 annually, while construction materials and IT equipment add substantial embodied emissions. Local AQ impacts, often overlooked, arise from diesel backup generators, construction equipment, and commuting. Generator testing alone can emit about 12 metric tons of NOx annually per facility, worsening ozone issues in regions such as Houston and Dallas-Fort Worth. Mitigation strategies include advanced cooling, renewable energy procurement, cleaner backup power (fuel cells, batteries), sustainable construction, and standardized reporting. ERCOT forecasts project 39 to 78 gigawatts of new data center load by 2030, potentially leading to 170 to 205 million metric tons of annual CO2 emissions. Aggressive adoption of renewables and advanced technologies could cut emissions by 50 to 80 percent, avoiding 85 to 165 million metric tons of CO2. The study identifies research and policy gaps, including the need for cumulative air dispersion modeling, AQ-specific regulations, and mandatory efficiency standards. Findings underscore the importance of aligning Texas digital infrastructure growth with environmental and community health protections.

  • 1 authors
·
Sep 25, 2025

MUSCLE: A Model Update Strategy for Compatible LLM Evolution

Large Language Models (LLMs) are frequently updated due to data or architecture changes to improve their performance. When updating models, developers often focus on increasing overall performance metrics with less emphasis on being compatible with previous model versions. However, users often build a mental model of the functionality and capabilities of a particular machine learning model they are interacting with. They have to adapt their mental model with every update -- a draining task that can lead to user dissatisfaction. In practice, fine-tuned downstream task adapters rely on pretrained LLM base models. When these base models are updated, these user-facing downstream task models experience instance regression or negative flips -- previously correct instances are now predicted incorrectly. This happens even when the downstream task training procedures remain identical. Our work aims to provide seamless model updates to a user in two ways. First, we provide evaluation metrics for a notion of compatibility to prior model versions, specifically for generative tasks but also applicable for discriminative tasks. We observe regression and inconsistencies between different model versions on a diverse set of tasks and model updates. Second, we propose a training strategy to minimize the number of inconsistencies in model updates, involving training of a compatibility model that can enhance task fine-tuned language models. We reduce negative flips -- instances where a prior model version was correct, but a new model incorrect -- by up to 40% from Llama 1 to Llama 2.

  • 7 authors
·
Jul 12, 2024 2

SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine Learning

Progress toward the United Nations Sustainable Development Goals (SDGs) has been hindered by a lack of data on key environmental and socioeconomic indicators, which historically have come from ground surveys with sparse temporal and spatial coverage. Recent advances in machine learning have made it possible to utilize abundant, frequently-updated, and globally available data, such as from satellites or social media, to provide insights into progress toward SDGs. Despite promising early results, approaches to using such data for SDG measurement thus far have largely evaluated on different datasets or used inconsistent evaluation metrics, making it hard to understand whether performance is improving and where additional research would be most fruitful. Furthermore, processing satellite and ground survey data requires domain knowledge that many in the machine learning community lack. In this paper, we introduce SustainBench, a collection of 15 benchmark tasks across 7 SDGs, including tasks related to economic development, agriculture, health, education, water and sanitation, climate action, and life on land. Datasets for 11 of the 15 tasks are released publicly for the first time. Our goals for SustainBench are to (1) lower the barriers to entry for the machine learning community to contribute to measuring and achieving the SDGs; (2) provide standard benchmarks for evaluating machine learning models on tasks across a variety of SDGs; and (3) encourage the development of novel machine learning methods where improved model performance facilitates progress towards the SDGs.

  • 10 authors
·
Nov 8, 2021

What Matters for Bioacoustic Encoding

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

  • 17 authors
·
Aug 15, 2025

GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

  • 10 authors
·
Jul 9, 2025

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

Shannon, in his seminal paper introducing information theory, divided the communication into three levels: technical, semantic, and effectivenss. While the technical level is concerned with accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Thanks to telecommunications, the first level problem has produced great advances like the internet. Large Language Models (LLMs) make some progress towards the second goal, but the third level still remains largely untouched. The third problem deals with predicting and optimizing communication for desired receiver behavior. LLMs, while showing wide generalization capabilities across a wide range of tasks, are unable to solve for this. One reason for the underperformance could be a lack of "behavior tokens" in LLMs' training corpora. Behavior tokens define receiver behavior over a communication, such as shares, likes, clicks, purchases, retweets, etc. While preprocessing data for LLM training, behavior tokens are often removed from the corpora as noise. Therefore, in this paper, we make some initial progress towards reintroducing behavior tokens in LLM training. The trained models, other than showing similar performance to LLMs on content understanding tasks, show generalization capabilities on behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. Using a wide range of tasks on two corpora, we show results on all these capabilities. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior.

  • 11 authors
·
Sep 1, 2023

Exploring Public Attention in the Circular Economy through Topic Modelling with Twin Hyperparameter Optimisation

To advance the circular economy (CE), it is crucial to gain insights into the evolution of public attention, cognitive pathways of the masses concerning circular products, and to identify primary concerns. To achieve this, we collected data from diverse platforms, including Twitter, Reddit, and The Guardian, and utilised three topic models to analyse the data. Given the performance of topic modelling may vary depending on hyperparameter settings, this research proposed a novel framework that integrates twin (single and multi-objective) hyperparameter optimisation for the CE. We conducted systematic experiments to ensure that topic models are set with appropriate hyperparameters under different constraints, providing valuable insights into the correlations between CE and public attention. In summary, our optimised model reveals that public remains concerned about the economic impacts of sustainability and circular practices, particularly regarding recyclable materials and environmentally sustainable technologies. The analysis shows that the CE has attracted significant attention on The Guardian, especially in topics related to sustainable development and environmental protection technologies, while discussions are comparatively less active on Twitter. These insights highlight the need for policymakers to implement targeted education programs, create incentives for businesses to adopt CE principles, and enforce more stringent waste management policies alongside improved recycling processes.

  • 6 authors
·
May 16, 2024