| --- |
| license: eupl-1.2 |
| language: code |
| base_model: |
| - NetherlandsForensicInstitute/ARM64BERT |
| library_name: sentence-transformers |
| --- |
| |
| ARM64BERT-embedding 🦾 |
| ====================== |
|
|
| [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers) |
|
|
| ## General |
| ### What is the purpose of the model |
| The model is a BERT model of ARM64 assembly code that can be used to find similar ARM64 functions to a given ARM64 function. |
| This task is known as _binary code similarity detection_, which is similar to the _sentence similarity_ task in natural language processing. |
|
|
| ### What does the model architecture look like? |
| The model architecture is inspired by [jTrans](https://github.com/vul337/jTrans) (Wang et al., 2022). |
| It is a BERT model (Devlin et al. 2019) although the typical Next Sentence Prediction has been replaced with Jump Target Prediction, as proposed in Wang et al. |
| This architecture has subsequently been finetuned for semantic search purposes. We have followed the procedure proposed by [S-BERT](https://www.sbert.net/examples/applications/semantic-search/README.html). |
|
|
| ### What is the output of the model? |
| The model returns an embedding vector of 768 dimensions for each function that it's given. These embeddings can be compared to |
| get an indication of which functions are similar to each other. |
|
|
| ### How does the model perform? |
| The model has been evaluated on [Mean Reciprocal Rank (MRR)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) and |
| [Recall@1](https://en.wikipedia.org/wiki/Precision_and_recall). |
| When the model has to pick the positive example out of a pool of 32, ranks the positive example highest most of the time. |
| When the pool is significantly enlarged to 10.000 functions, it still ranks the positive example first or second in most cases. |
|
|
| | Model | Pool size | MRR | Recall@1 | |
| |----------------------|-----------|------|----------| |
| | ARM64BERT | 32 | 0.78 | 0.72 | |
| | ARM64BERT-embedding | 32 | 0.99 | 0.99 | |
| | ARM64BERT | 10.000 | 0.58 | 0.56 | |
| | ARM64BERT-embedding | 10.000 | 0.87 | 0.83 | |
|
|
| ## Purpose and use of the model |
|
|
| ### For which problem has the model been designed? |
| The model has been designed to find similar ARM64 functions in a database of known ARM64 functions. |
|
|
| ### What else could the model be used for? |
| We do not see other applications for this model. |
|
|
| ### To what problems is the model not applicable? |
| This model has been finetuned on the semantic search task. |
| For the base ARM64BERT model, please refer to the [other |
| model](https://huggingface.co/NetherlandsForensicInstitute/ARM64BERT) we have published. |
|
|
| ## Data |
| ### What data was used for training and evaluation? |
| The dataset is created in the same way as Wang et al. created Binary Corp. |
| A large set of source code comes from the [ArchLinux official repositories](https://archlinux.org/packages/) and the [ArchLinux user repositories](https://aur.archlinux.org/packages/). |
| All this code is split into functions that are compiled into binary code with different optimalizations |
| (`O0`, `O1`, `O2`, `O3` and `Os`) and security settings (fortify or no-fortify). |
| This results in a maximum of 10 (5×2) different functions which are semantically similar, i.e. they represent the same functionality, but have different machine code. |
| The dataset is split into a train and a test set. This is done on project level, so all binaries and functions belonging to one project are part of |
| either the train or the test set, not both. We have not performed any deduplication on the dataset for training. |
|
|
| | set | # functions | |
| |-------|------------:| |
| | train | 18,083,285 | |
| | test | 3,375,741 | |
|
|
| For our training and evaluation code, see our [GitHub repository](https://github.com/NetherlandsForensicInstitute/asmtransformers). |
|
|
|
|
| ### By whom was the dataset collected and annotated? |
| The dataset was collected by our team. |
|
|
| ### Any remarks on data quality and bias? |
| After training our models, we found out that something had gone wrong when compiling our dataset. |
| Consequently, the first line of the next function was included in the previous. |
| This has been fixed for the finetuning, but due to the long training process, |
| and the good performance of the model despite the mistake, we have decided not to retrain the base model. |
|
|