new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 14

An open-source robust machine learning platform for real-time detection and classification of 2D material flakes

The most widely used method for obtaining high-quality two-dimensional materials is through mechanical exfoliation of bulk crystals. Manual identification of suitable flakes from the resulting random distribution of crystal thicknesses and sizes on a substrate is a time-consuming, tedious task. Here, we present a platform for fully automated scanning, detection, and classification of two-dimensional materials, the source code of which we make openly available. Our platform is designed to be accurate, reliable, fast, and versatile in integrating new materials, making it suitable for everyday laboratory work. The implementation allows fully automated scanning and analysis of wafers with an average inference time of 100 ms for images of 2.3 Mpixels. The developed detection algorithm is based on a combination of the flakes' optical contrast toward the substrate and their geometric shape. We demonstrate that it is able to detect the majority of exfoliated flakes of various materials, with an average recall (AR50) between 67% and 89%. We also show that the algorithm can be trained with as few as five flakes of a given material, which we demonstrate for the examples of few-layer graphene, WSe_2, MoSe_2, CrI_3, 1T-TaS_2 and hexagonal BN. Our platform has been tested over a two-year period, during which more than 10^6 images of multiple different materials were acquired by over 30 individual researchers.

  • 11 authors
·
Jun 26, 2023

A Family of LLMs Liberated from Static Vocabularies

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.

  • 37 authors
·
Mar 16