No Description

Sjim f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
.idea ab4c35e370 update status 2 years ago
classify_model f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
classify_service f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
pic 33c9fe56ac 增加readme 2 years ago
word2index f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
.DS_Store a76ba5490d rm ltp 2 years ago
.gitignore f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
flask.log 33c9fe56ac 增加readme 2 years ago
main.py 33c9fe56ac 增加readme 2 years ago
mt_clerk_test.sql f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
readme.md c80bd1a576 更新 'readme.md' 2 years ago
requirements.txt f82c9f837c 增加根据词向量距离排序挑选词库的算法实验 修改requirements.txt 2 years ago
test_simbert.py 33c9fe56ac 增加readme 2 years ago

readme.md

Exam Question Classification

本仓库为论文 Test Case Classification via Few-Shot Learning 实验代码,仅用于学术目的

Method: In this paper, we propose a test case classification approach based on few-shot learning and test case argumentation to address the limitations mentioned above. The proposed approach generates new test cases by the large pre-trained masked language model and extracts embedding representation by training word embedding models. Then a BiLSTM-based classifier is designed to perform test case classification by extracting the in-depth features. Besides, we also apply the attention mechanism to assign high weights to words that represent the test case category by lexicon matching.

相关工具版本

bert4keras==0.11.4
Flask==1.0.2
nlpcda==2.5.8
numpy==1.15.1
pyltp==0.4.0
pymysql==1.0.2
scikit_learn==1.2.1
torch==1.13.0
xlrd==1.1.0

同requirements.txt

项目结构

  • classify_model 数据集分类模型
  • classify_service 分类功能
    • chinese_roformer-sim-char SimBERTv2模型
    • chinese_simbert simbert模型
    • ltp_data 哈工大语言技术平台模型
    • word_list_data 训练集和测试集
    • splited_data 发展集、训练集和测试集
    • bilstm_attention.py bilstm训练主函数
    • contrast_experiment.py 经典分类模型效果输出
    • data_processor.py 数据处理工具
  • word2index 词向量数据

模型训练

  1. 参数调整

bilstm_attention.py 中全局变量进行修改

   vocab_size = 5000  # 词表大小
   embedding_size = 64  # 词向量维度
   num_classes = 6  # 6分类 todo
   sentence_max_len = 64  # 单个句子的长度
   hidden_size = 16
   
   num_layers = 1  # 一层lstm
   num_directions = 2  # 双向lstm
   lr = 1e-3
   batch_size = 16  # batch_size 批尺寸
   epochs = 50
   
   device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
   
   app_names = ["航天中认自主可控众包测试练习赛"]
   # 航天 random_state 6
   # 趣享 13
   # ,"决赛自主可控众测web自主可控运维管理系统"
   bug_type = ["不正常退出", "功能不完整", "用户体验", "页面布局缺陷", "性能", "安全"]
   lexicon = {0: [], 1: [], 2: [], 3: [], 4: [], 5: []}
   word_with_attention = {}
   n = 5  # 选择置信度最高的前n条数据
   m = 3  # 选择注意力权重最高的前m个词
   
   t1 = 3
   t2 = 8
   threshold_confidence = 0.9
  1. 运行模型训练

    python bilstm_attention.py
    

实验对比

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
对比分类器超参数说明
  1. k最近邻分类器
knn_classifier = KNeighborsClassifier()
def __init__(
        self,
        n_neighbors=5,
        *,
        weights="uniform",
        algorithm="auto",
        leaf_size=30,
        p=2,
        metric="minkowski",
        metric_params=None,
        n_jobs=None,
    )
  1. SVM分类器
svm_classifier = svm.SVC(C=2, kernel='rbf', gamma=10, decision_function_shape='ovr')
def __init__(
        self,
        *,
        C=1.0,
        kernel="rbf",
        degree=3,
        coef0=0.0,
        shrinking=True,
        probability=False,
        tol=1e-3,
        cache_size=200,
        class_weight=None,
        verbose=False,
        max_iter=-1,
        decision_function_shape="ovr",
        break_ties=False,
        random_state=None,
    )
  1. 朴素贝叶斯分类器
muNB_classifier = GaussianNB()
def __init__(self, *, priors=None, var_smoothing=1e-9)
  1. bpnn分类器
bpnn_classifier = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[10, 10])
def __init__(
        self,

        activation="relu",
        *,
        alpha=0.0001,
        batch_size="auto",
        learning_rate="constant",
        learning_rate_init=0.001,
        power_t=0.5,
        max_iter=200,
        shuffle=True,
        tol=1e-4,
        verbose=False,
        warm_start=False,
        momentum=0.9,
        nesterovs_momentum=True,
        early_stopping=False,
        validation_fraction=0.1,
        beta_1=0.9,
        beta_2=0.999,
        epsilon=1e-8,
        n_iter_no_change=10,
        max_fun=15000,
    )