Social-Issue-Aware Smishing Detector

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด SMS ๋ฌธ์ž ๋ฉ”์‹œ์ง€๊ฐ€ ์Šค๋ฏธ์‹ฑ(ํ”ผ์‹ฑ ์‚ฌ๊ธฐ)์ธ์ง€ ์ •์ƒ ๋ฌธ์ž ์ธ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ TF-IDF + Logistic Regression ์•™์ƒ๋ธ” ๋ถ„๋ฅ˜๊ธฐ์ž…๋‹ˆ๋‹ค.

๐Ÿ“– ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹ ์ถœ์ฒ˜

  • ๋ณธ ๋ชจ๋ธ์€ Hugging Face ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์˜คํ”ˆ ๋ฐ์ดํ„ฐ์…‹์ธ meal-bbang/Korean_message ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ๋ฐ ๊ฒ€์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค (์ด 16,309๊ฑด).

๐Ÿ“Œ ์ฃผ์š” ํŠน์ง•

  • ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ ๋ถˆํ•„์š”: char_wb n-gram ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ MeCab, KoNLPy ๋“ฑ์˜ ์„ค์น˜ ์—†์ด๋„ ํ•œ๊ตญ์–ด ์กฐ์‚ฌ์™€ ์–ด๊ฐ„ ํŒจํ„ด์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๋‹น ์ˆ˜์ฒœ ๊ฑด ์ด์ƒ ์ถ”๋ก  ๊ฐ€๋Šฅํ•œ ๊ทน๋„์˜ ๊ฐ€๋ฒผ์›€์„ ์ง€๋‹™๋‹ˆ๋‹ค.
  • ํŠน์ˆ˜ ๊ธฐํ˜ธ ์ฒ˜๋ฆฌ (์ „๊ฐ ๋ฌธ์ž ์ •๊ทœํ™”): unicodedata.normalize์˜ NFKC ๋ชจ๋“œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์€ํ–‰์—์„œ ๊ฐ€๋” ๋ณด๋‚ด๊ฑฐ๋‚˜ ๊ณต๊ฒฉ์ž๋“ค์ด ํ•„ํ„ฐ๋ง์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๊ผผ์ˆ˜๋กœ ์“ฐ๋Š” '์ „๊ฐ ๋ฌธ์ž'๋“ค์„ ์ผ๋ฐ˜ ๊ธ€์ž ํญ ํ˜•ํƒœ๋กœ ๋ฐ”๋ฅด๊ฒŒ ์ •๊ทœํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • URL ํ† ํฐ ์น˜ํ™˜ ๊ธฐ๋ฒ•: ๋จธ์‹ ๋Ÿฌ๋‹ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ URL ํ˜•ํƒœ ๊ทธ ์ž์ฒด์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ…์ŠคํŠธ์— ํฌํ•จ๋œ ๋ชจ๋“  ์›น ์ฃผ์†Œ(http/www)๋ฅผ ์ •๊ทœ์‹์„ ํ†ตํ•ด __URL__ ์ด๋ผ๋Š” ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ ์ผ๊ด„ ์น˜ํ™˜ํ•˜์—ฌ ํ•™์Šต์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š ๋ชจ๋ธ ์„ฑ๋Šฅ

์ƒ์„ธ ์ˆ˜์น˜๋Š” metrics.json ์ฐธ๊ณ .

  • AUC-ROC: 0.9995
  • F1 Score: 0.9981

๐Ÿš€ ์‚ฌ์šฉ ๋ฐฉ๋ฒ• (Python)

import joblib
from huggingface_hub import hf_hub_download

# ํ—ˆ๋ธŒ์—์„œ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ
model_path = hf_hub_download(repo_id="Hyeonseo/ko-smishing-detector", filename="pipeline.pkl")
pipeline = joblib.load(model_path)

# ์ถ”๋ก  ํ…Œ์ŠคํŠธ (0: ์ •์ƒ, 1: ์Šค๋ฏธ์‹ฑ)
texts = [
    "[Web๋ฐœ์‹ ] ์•ˆ๋…•ํ•˜์„ธ์š”, ์žฌ๋‚œ์ง€์›๊ธˆ ์‹ ์ฒญ ์•ˆ๋‚ด์ž…๋‹ˆ๋‹ค. http://bit.ly/fakeurl",
    "๋Œ€๋ฆฌ๋‹˜ ๋‚ด์ผ ์˜คํ›„ 3์‹œ ํšŒ์˜ ์ž๋ฃŒ ์ฒจ๋ถ€ํ•ฉ๋‹ˆ๋‹ค."
]
probas = pipeline.predict_proba(texts)[:, 1]

for txt, score in zip(texts, probas):
    print(f"์Šค๋ฏธ์‹ฑ ํ™•๋ฅ  {score:.2%} : {txt}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train Hyeonseo/ko-smishing-detector