This repo contains the source code in the paper On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective.
This project is to evaluate the robustness of ChatGPT as well as some foundation language models. You can find the results here.
First, clone and get into the repo:
git clone https://github.com/microsoft/robustlearn.git
cd robustlearn/chatgpt-robust
Then, install the following important depencencies by:
pip install transformers pandas nltk jieba
You can also create an conda virtual environment by running conda env create -f environment.yml
.
All things can be used by running main.py
:
For classification tasks:
python main.py --dataset advglue --task sst2 --service hug --model xxx
python main.py --dataset advglue --task sst2 --service gpt --model text-davinci-003
For translation tasks:
python main.py --dataset advglue-t --task translation_en_to_zh --service hug --model xx
Note that you will not get the final results by simply running the codes, since the outputs of generative models are not stable. We need some manual processing. Bad cases of AdvGLUE and Flipkart are pvovided in this folder.
Here is the summary of the results.
The metric is attack success rate (ASR).
Model | SST-2 | QQP | MNLI | QNLI | RTE | ANLI |
---|---|---|---|---|---|---|
Random | 50.0 | 50.0 | 66.7 | 50.0 | 50.0 | 66.7 |
DeBERTa-L (435 M) | 66.9 | 39.7 | 64.5 | 46.6 | 60.5 | 69.3 |
BART-L (407 M) | 56.1 | 62.8 | 58.7 | 52.0 | 56.8 | 57.7 |
GPT-J | 48.7 | 59.0 | 73.6 | 50.0 | 56.8 | 66.5 |
T5 (11 B) | 40.5 | 59.0 | 48.8 | 49.7 | 56.8 | 68.6 |
T0 (11 B) | 36.5 | 60.3 | 72.7 | 49.7 | 56.8 | 77.2 |
NEOX-20B | 52.7 | 56.4 | 59.5 | 54.0 | 48.1 | 70.0 |
OPT (66 B) | 47.6 | 53.9 | 60.3 | 52.7 | 58.0 | 58.3 |
BLOOM (176 B) | 48.7 | 59.0 | 73.6 | 49.7 | 56.8 | 66.5 |
text-davinci-002 (175 B) | 46.0 | 28.2 | 54.6 | 45.3 | 35.8 | 68.8 |
text-davinci-003 (175 B) | 44.6 | 55.1 | 44.6 | 38.5 | 34.6 | 62.9 |
ChatGPT (175 B) | 39.9 | 18.0 | 32.2 | 34.5 | 24.7 | 55.3 |
The metrics are BLEU, GLEU, and METEOR.
Translation | BLEU | GLEU | METEOR |
---|---|---|---|
Helsinki-NLP/opus-mt-en-zh | 18.11 | 26.78 | 46.38 |
liam168/trans-opus-mt-en-zh | 15.23 | 24.89 | 45.02 |
text-davinci-002 | 24.97 | 36.3 | 59.28 |
text-davinci-003 | 30.6 | 40.01 | 61.88 |
ChatGPT | 26.27 | 37.29 | 58.95 |
The metric is F1 score.
Model | Flipkart | DDXPlus |
---|---|---|
Random | 20 | 4 |
DeBERTa-L (435 M) | 60.6 | 4.5 |
BART-L (407 M) | 57.8 | 5.3 |
GPT-J | 28 | 2.4 |
T5 (11 B) | 58.8 | 6.3 |
T0 (11 B) | 58.3 | 8.4 |
NEOX-20B | 39.4 | 12.3 |
OPT (66 B) | 44.5 | 0.3 |
BLOOM (176 B) | 28 | 0.1 |
text-davinci-002 (175 B) | 57.5 | 18.9 |
text-davinci-003 (175 B) | 57.3 | 19.6 |
ChatGPT (175 B) | 60.6 | 20.2 |
@article{wang2013robustness,
title={On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective},
author={Wang, Jindong and Hu, Xixu and Hou, Wenxin and Chen, Hao and Zheng, Runkai and Wang, Yidong and Yang, Linyi and Huang, Haojun and Ye, Wei and Geng, Xiubo and Jiao, Binxin and Zhang, Yue and Xie, Xing},
journal={arXiv preprint arXiv:2302.12095},
year={2023}
}
Note that the results of some generative models might change due to the nature of generation. Thus, the results of this code repo could also change. Additionally, the results output by generative foundation models are not very clean, so you need to manually process some dirty outputs to get what you want. Our best suggestion to use this code is for demo and practice, since it makes it easy to get the outputs of multiple models. But remember, human evaluation and processing are also important.