SentenceTransformer based on sentence-transformers/all-distilroberta-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1 on the ai-job-embedding-finetuning dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-distilroberta-v1
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- ai-job-embedding-finetuning

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("khengkok/distilroberta-ai-job-embeddings")
# Run inference
queries = [
    "Revenue Operations data analysis, sales forecasting models, sales territory and quota optimization",
]
documents = [
    "skills:\n\nExperience with “Lean Management” and/or “Six Sigma” concepts.Be able to analyze processes/workflows and find opportunities to streamline/improve/eliminate waste.Be able to create value stream maps Experience with Microsoft Viso.Office products (MS Word/MS Excel/Teams) MS Access\n\nMinimum required work experience:\n\nExcellent entry level opportunity!\n\nJob/class description:\n\nExtracts data from multiple systems and departments using various data manipulation and extraction techniques for regular, cyclical, and ad hoc reporting.Performs research, analyzes reports, and creates statistical models for presentation/review. Summarizes findings and communicates results to management.Identifies operational inadequacies and uses various skills and resources to retool processes.Communicates with other areas regarding outcomes and reporting.\n\nRequired knowledge, skills, and abilities:\n\nGood organizational, customer service, communications, and analytical skills.Ability to use complex mathematical calculations and understand mathematical and statistical concepts.Knowledge of relevant computer support systems.Microsoft Office.Ability to acquire programming skills across various software platforms.Good communication verbal/written, good organization, good analysis, customer service, cross team facilitation.\n\nPreferred knowledge, skills, and abilities:\n\nNegotiation or persuasion skills.Ability to acquire or knowledge of ICD9/CPT4 coding.SAS and/or DB2, or other relational database.\n\nWork environment:\n\nTypical office environment. Some travel between buildings and out of town.The team has 11 members, each are diverse individuals whom strive to exceed customer expectations. With in the greater team is a smaller team of 3 individuals whom compose the “plan” team.This person would be a part of this sub team.They work as a close-knit group and embrace a team atmosphere.They enjoy having fun while getting the work done\n\nRequired education/equivalencies:\n\nBachelor's degree Statistics, Computer Science, Mathematics, Business, Healthcare, or other related field.OR 2 year degree in Computer Science, Business or related field and 2 years of reporting and data analysis work experienceOR 4 years reporting and data analysis experience.\n\nInterested? Learn more:\n\nClick the apply button or contact our recruiter Kyle at Kyle.Croft@dppit.com to learn more about this position (#24-00288).\n\nDPP offers a range of compensation and benefits packages to our employees and their eligible dependents. Call today to learn more about working with DPP.\n\nUS Citizen: This role requires the ability to obtain a low-level US security clearance, which requires a thorough background search and US citizenship. Residency requirements may apply.",
    'requirements, collect data, lead cleansing efforts, and load/support data into SAPthe gap between business and IT teams, effectively communicating data models and setting clear expectations of deliverablesand maintain trackers to showcase progress and hurdles to Project Managers and Stakeholders\nQualifications\nknowledge of SAP and MDGcommunication skillsto manage multiple high-priority, fast-paced projects with attention to detail and organizationan excellent opportunity to learn an in-demand area of SAP MDGa strong willingness to learn, with unlimited potential for growth and plenty of opportunities to expand skills\nThis role offers a dynamic environment where you can directly impact IT projects and contribute to the company’s success. You will work alongside a supportive team of professionals, with ample opportunities for personal and professional development. \nIf you’re ready to take on new challenges and grow your career in data analytics and SAP, apply now and be part of our journey toward excellence.',
    "experience with a minimum of 0+ years of experience in a Computer Science or Data Management related fieldTrack record of implementing software engineering best practices for multiple use cases.Experience of automation of the entire machine learning model lifecycle.Experience with optimization of distributed training of machine learning models.Use of Kubernetes and implementation of machine learning tools in that context.Experience partnering and/or collaborating with teams that have different competences.The role holder will possess a blend of design skills needed for Agile data development projects.Proficiency or passion for learning, in data engineer techniques and testing methodologies and Postgraduate degree in data related field of study will also help. \n\n\nDesirable for the role\n\n\nExperience with DevOps or DataOps concepts, preferably hands-on experience implementing continuous integration or highly automated end-to-end environments.Interest in machine learning will also be advantageous.Experience implementing a microservices architecture.Demonstrate initiative, strong customer orientation, and cross-cultural working.Strong communication and interpersonal skills.Prior significant experience working in Pharmaceutical or Healthcare industry environment.Experience of applying policies, procedures, and guidelines.\n\n\nWhy AstraZeneca?\n\nWe follow all applicable laws and regulations on non-discrimination in employment (and recruitment), as well as work authorization and employment eligibility verification requirements. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment.\n\nWhen we put unexpected teams in the same room, we unleash bold thinking with the power to inspire life-changing medicines. In-person working gives us the platform we need to connect, work at pace and challenge perceptions. That’s why we work, on average, a minimum of three days per week from the office. But that doesn't mean we’re not flexible. We balance the expectation of being in the office while respecting individual flexibility. Join us in our unique and ambitious world.\n\nCompetitive Salary & Benefits\n\nClose date: 10/05/2024\n\nSo, what’s next! \n\n\nAre you already imagining yourself joining our team? Good, because we can’t wait to hear from you. Don't delay, apply today!\n\n\nWhere can I find out more?\n\nOur Social Media, Follow AstraZeneca on LinkedIn: https://www.linkedin.com/company/1603/\n\nInclusion & Diversity: https://careers.astrazeneca.com/inclusion-diversity\n\nCareer Site: https://careers.astrazeneca.com/",
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.2975, 0.3023, 0.2415]])

Evaluation

Metrics

Triplet

Datasets: ai-job-validation and ai-job-test
Evaluated with TripletEvaluator

Metric	ai-job-validation	ai-job-test
cosine_accuracy	0.4545	0.6

Training Details

Training Dataset

ai-job-embedding-finetuning

Dataset: ai-job-embedding-finetuning at f10e8c6
Size: 269 training samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 269 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 8 tokens mean: 14.81 tokens max: 26 tokens	min: 14 tokens mean: 330.53 tokens max: 512 tokens	min: 10 tokens mean: 333.68 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`Orlando data analyst SQL query optimization advanced dashboard development utilities industry insights`	skills in a global environment. Finally, you will interact with other members of our United States Health and Benefits team and can make important contributions to process improvements and new analytical tools. This position requires an analytical mind who is detail oriented with work product and outputs using Microsoft Office tools. The position also requires the ability to accurately execute written and verbal instructions. The Role Manage NQTL Operational Data Portion Of Parity Assessment, Including Prepare NQTL carrier operational data requests on behalf of each client/carrierCoordinate with Project Manager regarding sending requests, timing, status, and follow-upAttend internal and client kick off meeting with QTL/NQTL team Monitor carrier and vendor responsiveness to data requestsValidate completeness of response and report any issues or impact to timeline proactively to Project ManagerComplete initial review of carrier responses for parity projectsMap carrier responses to ap...	skills:Proficiency in Python programming languageKnowledge of natural language processing (NLP), data science, and deep learning algorithms (RNN, CNN, etc.)Ability to implement machine learning algorithms and statistical analysisStrong presentation and teaching skills to articulate complex concepts to non-technical audiencesUnderstanding of data structures and algorithms in PythonExcellent research skills, utilizing papers, textbooks, online resources, and GitHub repositoriesPotential involvement in writing and publishing academic papers Qualifications2nd or 3rd-year undergraduate student in computer science or statisticsRequired experience: candidates must have completed at least three of the following courses: Statistics, Machine Learning, Deep Learning, AI, and Data Structures and Algorithms.GPA of 3.5 or higher.Ability to work independently and collaborativelyExcellent problem-solving and analytical skillsStrong written and verbal communication skills Relevant coursework projects o...
`Clarity PPM data analysis, project portfolio reporting, resource capacity planning`	requirements into an efficient process and/or system solution? If so, DHL Supply Chain has the opportunity for you. Job DescriptionTo apply knowledge and analytics to develop and communicate timely, accurate, and actionable insight to the business through the use of modeling, visualization, and optimization. Responsible for the reporting, analyzing, and predicting of operational processes, performance, and Key Performance Indicators. Communication with site leadership, operations, and finance on efficiency, customer requirements, account specific issues, and insight into to the business, operations, and customer. Applies hindsight, insight, and foresight techniques to communicate complex findings and recommendations to influence others to take actionUses knowledge of business and data structure to discover and/or anticipate problems where data can be used to solve the problemUses spreadsheets, databases, and relevant software to provide ongoing analysis of operational activitiesApplies...	Qualifications) Bachelor's degree in a relevant field such as mathematics, statistics, or computer science Minimum of 5 years of experience as a data analyst or similar role Proficiency in SQL, Python, and data visualization tools Strong analytical and problem-solving skills Excellent written and verbal communication skills How To Stand Out (Preferred Qualifications) Master's degree in a relevant field Experience with machine learning and predictive modeling Knowledge of cloud-based data platforms such as AWS or Google Cloud Familiarity with Agile methodologies and project management tools Strong attention to detail and ability to work independently #RecruitingSoftware #DataAnalysis #RemoteWork #CareerOpportunity #CompetitivePay At Talentify, we prioritize candidate privacy and champion equal-opportunity employment. Central to our mission is our partnership with companies that share this commitment. We aim to foster a fair, transparent, and secure hiring environment for all. If ...
`AAA game AI engineer pathfinding vehicle navigation`	skills and knowledge in a supportive and empowering environment. Technology StackWe utilize the Google Cloud Platform, Python, SQL, BigQuery, and Looker Studio for data analysis and management.We ingest data from a variety of third-party tools, each providing unique insights.Our stack includes DBT and Fivetran for efficient data integration and transformation. Key ResponsibilitiesCollaborate with teams to understand data needs and deliver tailored solutions.Analyze large sets of structured and unstructured data to identify trends and insights.Develop and maintain databases and data systems for improved data quality and accessibility.Create clear and effective data visualizations for stakeholders.Stay updated with the latest trends in data analysis and technologies. Qualifications and Skills2-3 years of hands-on experience in data.You can distill complex data into easy to read and interpret dashboards to enable leadership / business teams to gather data insights and monitor KPIs.Solid u...	Qualifications: Good communication verbal/written, Good organization, Good analysis, Customer service, cross team facilitation.Experience with “Lean Management” and/or “Six Sigma” concepts.Be able to analyze processes/workflows and find opportunities to streamline/improve/eliminate waste.Be able to create value stream maps.Experience with Microsoft Visio.Office products (MS Word/MS Excel/Teams) MS AccessBachelors degree Statistics, Computer Science, Mathematics, Business, Healthcare, or other related field. or 2 year degree in Computer Science, Business or related field and 2 years of reporting and data analysis work experience OR 4 years reporting and data analysis experience.

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false
}

Evaluation Dataset

ai-job-embedding-finetuning

Dataset: ai-job-embedding-finetuning at f10e8c6
Size: 33 evaluation samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 33 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 10 tokens mean: 15.36 tokens max: 33 tokens	min: 21 tokens mean: 329.55 tokens max: 512 tokens	min: 31 tokens mean: 321.73 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`Power BI dashboards, SQL data transformation, Databricks analytics consultant`	experience desired Extensive experience with database and SQL tools including MS SQL, Tableau, Visual BASIC, and EXCEL Ability to work with counterparts in the organization with varying levels of technical expertise, including Marketing, Product, and IT personnel Ability to work independently and efficiently on a high volume of tasks Stay updated with emerging trends and best practices in data visualization and analytics to continuously improve reporting capabilities Why Work For Us 4 weeks accrued paid time off + 9 paid national holidays per year Tuition Reimbursement Low cost and excellent coverage health insurance options (medical, dental, vision) Gym membership reimbursement Robust health and wellness program and fitness reimbursements Auto and home insurance discounts Matching gift opportunities Annual 401(k) Employer Contribution (up to 7.5% of your base salary) Various Paid Family leave options including Paid Parental Leave $3,000 one-time bonus payment on healt...	`Qualifications:Relevant educational qualification or degree in Data analytics or Data Science or Statistics or Applied Mathematics or equivalent qualification. (Required)Experience with Tableau.(Optional)Familiar with Python, Big Data. (Optional)Proficient in SQL.Candidates who are missing the required skills, might be provided an option to enhance their skills, so that they can also apply for the role and can make a career in the IT industry.Freshers can also apply`
`product analyst SQL data migration Agile user stories`	requirements, developing reporting, and enabling efficiencies. You will also encourage analytics independence as a subject matter expert and champion of business intelligence software (e.g. Power BI, Tableau, etc.). The group also leads the Accounting Department’s Robotic Process Automation efforts. Kiewit is known as an organization that encourages high performers to challenge themselves by operating in roles they may not be classically trained for. This position embodies this spirit as the experiences will lend themselves nicely into several potential paths including accounting roles / leadership, operations management, data analysis roles and technology group positions. District Overview At Kiewit, the scale of our operations is huge. Our construction and engineering projects span across the United States, Canada and Mexico, improving and connecting communities with every initiative. We depend on our high-performing operations support professionals — they’re the glue that holds m...	QualificationsBachelor's degree in Computer Science, Statistics, Mathematics, Economics, or related field. At least five years of experience as a Data Analyst in a digital media or ecommerce setting.Proficiency in SQL, Python, R, or other programming languages for data manipulation and analysis.Experience with Google Data Studio or other data visualization tools.Experience creating custom data pipelines, automated reports, and data visualizations.Expertise in web and mobile analytics platforms (e.g. Google Analytics, Adobe Analytics, AppsFlyer, Amplitude).Current understanding of internet consumer data privacy matters.Excellent communication and collaboration skills, with the ability to present findings and recommendations to both technical and non-technical stakeholders.Strong analytical skills and attention to detail, with the ability to translate complex data into actionable insights. Preferred QualificationsExperience with video delivery systems (encoding platforms, video players,...
`healthcare claims data analysis, complex SQL optimization, ETL process support`	requirements, and integrated management systems for our countries civilian agencies (FAA, FDIC, HOR, etc.).Our primary mission is to best serve the needs of our clients by solutioning with our stakeholder teams to ensure that the goals and objectives of our customers are proactively solutioned, such that opportunities to invest our time in developing long-term solutions and assets are abundant and move our clients forward efficiently.At DEVIS, we are enthusiastic about our research, our work and embracing an environment where all are supported in the mission, while maintaining a healthy work-life balance. We are currently seeking a Data Analyst to join one of our Department of State programs. The candidate would support the Bureau of Population, Refugees, and Migration (PRM) Refugee Processing Center (RPC) in Rosslyn, VA. The ideal candidate must be well-versed in ETL services and adept at gathering business requirements from diverse stakeholders, assessing the pros/cons of ETL tools, ...	experience in data analysis, preferably in a data warehouse environment.Strong proficiency in SQL and experience with data modeling and mapping.Familiarity with star schema design and data warehousing concepts.Excellent analytical and problem-solving skills.Strong communication and interpersonal skills, with the ability to explain complex data concepts to non-technical stakeholders.Ability to manage multiple projects and meet deadlines in a fast-paced environment.Experience with data visualization tools (e.g., Tableau) is a plus. Required Soft Skills:Good analytical and problem-solving skillsExceptional communication skills (written and verbal)Good documentation skillsProficiency in English language (as a medium of communication)Frank and open communication with peers and higher-ups about realistic estimations and meeting timelines/expectations and proactive communication of issues and concerns thereof.Nice to have:Dimensional Modeling using Star SchemaKnowledge about ETL tools and how...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 1
warmup_ratio: 0.1
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	ai-job-validation_cosine_accuracy	ai-job-test_cosine_accuracy
-1	-1	0.4545	0.6000

Framework Versions

Python: 3.12.12
Sentence Transformers: 5.1.2
Transformers: 4.57.1
PyTorch: 2.9.0+cu126
Accelerate: 1.11.0
Datasets: 4.0.0
Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 1

Safetensors

Model size

82.1M params

Tensor type

F32

Model tree for khengkok/distilroberta-ai-job-embeddings

Base model

sentence-transformers/all-distilroberta-v1

Finetuned

(50)

this model

Dataset used to train khengkok/distilroberta-ai-job-embeddings

Papers for khengkok/distilroberta-ai-job-embeddings

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy on ai job validation
self-reported

0.455
Cosine Accuracy on ai job test
self-reported

0.600