Introduction

In this work, we define mixtext, a form of mixed text involving both AI and human-generated content. Then, we introduce **MixSet**, the first dataset dedicated to studying these mixtext scenarios. Leveraging **MixSet**, we executed comprehensive experiments to assess the efficacy of prevalent MGT detectors in handling mixtext situations, evaluating their performance in terms of effectiveness, robustness, and generalization.

New Definitions to Mixture of MGT and HWT. We defined mixtext, a form of mixed text involving both AI and human-generated content, providing a new perspective for further exploration in related fields.
A Novel Dataset. We proposed a new dataset **MixSet**, which specifically addresses the mixture of MGT and HWT, encompassing a diverse range of operations within real-world scenarios, addressing gaps in previous research.
Comprehensive Experiments and Valuable Insights. Based on MixSet, we conducted extensive experiments involving mainstream detectors and obtained numerous insightful findings, which provide a strong impetus for future research.

Abstract

With the rapid development and widespread application of Large Language Models (LLMs), the use of Machine-Generated Text (MGT) has become increasingly common, bringing with it potential risks, especially in terms of quality and integrity in fields like news, education, and science. Current research mainly focuses on purely MGT detection without adequately addressing mixed scenarios, including AI-revised Human-Written Text (HWT) or human-revised MGT. To tackle this challenge, we define mixtext, a form of mixed text involving both AI and human-generated content. Then, we introduce **MixSet**, the first dataset dedicated to studying these mixtext scenarios. Leveraging **MixSet**, we executed comprehensive experiments to assess the efficacy of prevalent MGT detectors in handling mixtext situations, evaluating their performance in terms of effectiveness, robustness, and generalization. Our findings reveal that existing detectors struggle to identify mixtext, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixtext, offering valuable insights for future research.

MixSet Dataset Construction

In this section, we present **MixSet** (**Mix**case Data**set**), the first dataset featuring a blend of HWT and MGT. Distinguished from earlier datasets exclusively composed of pure HWT and MGT, **MixSet** comprises a total of 3,600 mixtext instances, and the pipeline of its construction is shown in Figure 3. These operations are grounded in real-world application scenarios, each altered by a single LLM or through manual intervention, contributing 300 instances in our **MixSet**.

For our base data, we meticulously select pure HWT and MGT datasets. In the case of HWT, we gather datasets proposed before the widespread use of LLMs to mitigate potential contamination by MGT, as detailed in Table 1. For MGT, we choose samples from previous datasets, generated in a QA pattern by different LLMs, including the GPT family, ChatGLM, BloomZ, Dolly 4 , and StableLM, all distinct from our **MixSet** instances.

Original Sources
Operations
Operation Details
Size
Metric Analysis

The original resources of Human Written Texts in constructing our MixSet.

Combined with previous studies and real scenarios, we use five operations to generate mixtexts. They are divided into two operations: 1) AI-revised: it contains three operations including ‘polish’, ‘complete’, and ‘rewrite’. 2) Human-revised: it includes ‘adapt’ and ‘humanize’.

Polish. Polish operation contains token-level and sentence-level polishing. Token-level makes alterations at the individual word level, including changes such as adjusting words for precision or correcting spelling errors. On the other hand, sentence-level aims to enhance the overall coherence and clarity of the text by revising and restructuring the complete sentence.
Complete. Complete operation involves taking 1/3 of every text and employing LLMs to generate the rest of the text.
Rewrite. Rewrite operation requires LLMs to initially comprehend and extract key information from the given HWT and then rewrite them.
Humanize. Humanize operation typically refers to the modification of MGT to more closely mimic the natural noise for LLM that human writing always brings. We employed LLMs to introduce various perturbations to the pure MGT, including typo, grammatical mistakes, links, and tags.
Adapt. Adapt operation refers to modifying MGT to ensure its alignment to fluency and naturalness to human linguistic habits without introducing any error expression. The adapt operation is also divided into token-level and sentence-level adaptation. We accordingly performed manual annotations on the pure MGT dataset at both the token and sentence levels.

Experiment Setups

Class Number. In real-world scenarios, people often aim to detect the presence of MGT in the text (e.g., spreading fake news or propaganda, reinforcing and intensifying prejudices), and sometimes mixtext is also treated as MGT (e.g., student modified some words in MGT (i.e., mixtext) to generate homework, to avoid detection). Therefore, our experiments established two categorization systems: binary and three-class. In the binary classification, mixtext is categorized as MGT, while in the three-class classification, mixtext is treated as a separate class.

Detectors
Metrics
Train/Test Split

Detector	Q 1	Q 2	Q 3	Q 4
Log-likelihood	✔️	✔️	✖️	✔️
Entropy	✔️	✔️	✖️	✖️
GLTR	✔️	✔️	✖️	✖️
Log-rank	✔️	✔️	✖️	✖️
DetectGPT	✔️	✔️	✔️	✔️
Radar	✔️	✔️	✖️	✖️
ChatGPT Detector	✔️	✔️	✔️	✔️
DistillBert	✔️	✔️	✔️	✖️
GPT-sentinel	✔️	✔️	✖️	✖️
OpenAI Classifier	✔️	✖️	✖️	✖️
Ghostbuster	✔️	✖️	✖️	✖️
GPTZero	✔️	✖️	✖️	✖️

The details of class number, metrics, and whether the detectors are retrained in our experiments. Except for Question 2(b), we implement binary classifications i.e., HWT and MGT. Per. stands for Percentage.

Setting	Q 1	Q 2		Q 3	Q 4
Setting	Q 1	(a)	(b)	Q 3	Q 4
Class Num.	2-Class	2-Class	3-Class	2-Class	2-Class
Metric	MGT Per.	F1, AUC	F1	AUC	F1, AUC
Retrained?	✖️	✔️	✔️	✔️	✔️

Experiment	Data
Q 1	10k	0
Q 2(a)	10k	3k
Q 2(b)	10k	3k
Q 3(Ope.)	1k	0.5k
Q 3(LLM)	5k	1.5k
Q 4	1k/4k/7k/10k	0/1.5k/3k

Question 1. Based on MixSet, we evaluate current detectors to determine the classification preferences on mixtext, i.e., Does the detector tend to classify mixtext as MGT or HWT? We calculate the percentage of mixtext samples categorized to MGT in the experiment. For the DistilBERT detector and other metric-based detectors utilizing logistic regression models, we employ a training set comprising 10,000 pre-processed samples of both pure HWT and MGT. For other detectors, we use existing checkpoints or API and evaluate them in a zero-shot setting.

Question 2(a). Following Question 1, our inquiry is whether the detector can accurately classify mixtext as MGT after training on MixSet. We finetune detectors on pure HWT and MGT data and a train split set of our MixSet labeled as MGT.

Question 2(b). On the other hand, assuming that mixtext lies outside the distribution of HWT and MGT, we conduct a three-class classification task, treating mixtext as a new label. In this scenario, we adopt multi-label training for these detectors while keeping all other settings consistent.

Question 3. Transfer ability is crucial for detectors, our objective is to investigate the effectiveness of transferring across different subsets of MixSet and LLMs. We establish two transfer experiments to assess whether the transferability of current detection methods is closely linked to the training dataset, referred to as operation-generalization and LLM-generalization:

Operation-generalization (3a). We initially train our detectors on one MixSet subset operated by one of these operations, along with pure HWT and MGT datasets, and then proceed to transfer it to the subsets processed by other operations.
LLM-generalization (3b) In this experiment, we train detectors on GPT-generated texts and HWT, following which we evaluate the detectors on mixtext generated by GPT family and Llama2, respectively, to see whether there is a generalization gap between different LLMs.

Question 4. Empirically, incorporating more training data has been shown to enhance detection capabilities and robustness for generalization. To determine the relation between detectors’ performance and the size of the training set, we follow Question 2 and use varying sizes of training sets to retrain detectors.

Empirical Results

Overall (Ex1 & Ex2)
Integrating MixText for Binary
Ex3a
Ex3b
Ex4

F1 score of experiment 2 (a) and (b). Tok. stands for token level and Sen. stands for sentence level. We underscore the best-performing detector and bold the score greater than 0.8, which we consider as a baseline threshold for detection.

Detection Method	Average	AI-Revised								Human-Revised
		Complete		Rewrite		Polish-Tok.		Polish-Sen.		Humanize		Adapt-Sen.	Adapt-Tok.
		Llama2	GPT-4	Llama2	GPT-4	Llama2	GPT-4	Llama2	GPT-4	Llama2	GPT-4	Adapt-Sen.	Adapt-Tok.
Experiment 2 (a): Binary Classification
log-rank	0.615	0.695	0.686	0.637	0.479	0.617	0.606	0.647	0.595	0.617	0.454	0.676	0.667
log likelihood	0.624	0.695	0.695	0.637	0.492	0.657	0.627	0.657	0.657	0.657	0.386	0.676	0.667
GLTR	0.588	0.686	0.647	0.606	0.441	0.574	0.585	0.637	0.540	0.617	0.400	0.657	0.667
DetectGPT	0.635	0.715	0.651	0.656	0.560	0.632	0.587	0.657	0.632	0.692	0.587	0.641	0.609
Entropy	0.648	0.690	0.671	0.681	0.613	0.681	0.671	0.681	0.671	0.623	0.430	0.681	0.681
Openai Classifier	0.209	0.171	0.359	0.031	0.197	0.145	0.270	0.247	0.439	0.247	0.316	0.000	0.090
ChatGPT Detector	0.660	0.705	0.696	0.676	0.583	0.676	0.647	0.647	0.594	0.667	0.615	0.705	0.705
Radar	0.876	0.867	0.877	0.877	0.877	0.877	0.877	0.877	0.877	0.877	0.877	0.877	0.877
GPT-sentinel	0.713	0.714	0.714	0.714	0.714	0.714	0.714	0.714	0.714	0.714	0.696	0.714	0.714
Distillbert	0.664	0.667	0.667	0.667	0.667	0.667	0.667	0.667	0.667	0.667	0.639	0.667	0.667
Experiment 2 (b): Three-class Classification
DetectGPT	0.255	0.276	0.210	0.295	0.278	0.283	0.234	0.271	0.237	0.280	0.222	0.233	0.235
ChatGPT Detector	0.304	0.288	0.346	0.283	0.288	0.395	0.341	0.265	0.328	0.267	0.317	0.253	0.273
Radar	0.775	0.804	0.842	0.797	0.837	0.831	0.820	0.815	0.837	0.884	0.889	0.510	0.429
Distillbert	0.261	0.267	0.333	0.319	0.329	0.294	0.309	0.294	0.329	0.309	0.342	0.000	0.010

Detector	F1	AUC
log-rank	0.830	0.821	0.922	0.922
log likelihood	0.845	0.834	0.931	0.931
GLTR	0.831	0.818	0.920	0.920
DetectGPT	0.746	0.725	0.820	0.820
Entropy	0.770	0.770	0.859	0.859
ChatGPT Det.	0.881	0.896	0.954	0.979
Radar	0.997	0.997	1.000	1.000
GPT-sentinel	0.988	0.982	1.000	0.999
Distillbert	0.996	0.984	1.000	1.000

Method	w.o MixSet	w. MixSet
GPT-sentinel	0.813	0.739	0.972	0.971
Radar	0.834	0.729	0.997	1.000
ChatGPT Det.	0.664	0.445	0.681	0.480
Distillbert	0.687	0.638	0.673	0.698

There is no obvious classification preference in current detectors on mixtext.

In other words, the detectors do not exhibit a strong tendency to classify mixtext as either HWT or MGT. As we can observe from Figure 2 and Table 10, it is evident that the MGT percentage of mixtexts is between MGT and HWT, indicating that the current detectors do not have a strong preference towards mixtext classification. This proves the success and effectiveness of our constructed **MixSet** in presenting mixed features of HWT and MGT, demonstrating the limitations of existing detectors in recognizing mixtext. When dealing with mixtext, the detectors treat it as an intermediate state between HWT and MGT. Most detectors exhibit inconsistent classification within a single subset, fluctuating between accuracies of 0.3 and 0.7, akin to random choice. In AIrevised scenarios, subsets, such as polished tokens or sentences, pose extreme detection challenges. Mainstream detectors generally perform poorly in these cases due to the subtle differences between mixtext and original text. Furthermore, texts generated by Llama2-70b are easier to detect than those by GPT-4, possibly due to GPT-4’s closer generative distribution to human writing.

Binary and Three-class Classification

Supervised binary classification yields profound results; however, three-classes classification encounters significant challenges when applied to mixtext scenarios except Radar. Retrained model-based detectors outperform metric-based methods in both binary and three-class classification tasks. Notably, Radar ranks first in our results, achieving a significant lead over other detectors. We suppose that this superior performance can be attributed to its encoder-decoder architecture, which boasts 7 billion trainable parameters, substantially more than its counterparts. We also examined the impact of retraining on **MixSet** on MGT detection performance. There was a slight decrease in the F1 score, while the AUC metric remained largely unaffected. Notably, post-retraining, the detector acquired the capability to identify mixtext—an advancement deemed highly valuable. This ability to detect mixtext, despite a minor trade-off in F1 score for MGT detection, represents a significant step forward, suggesting a promising direction for enhancing detector versatility and applicability in varied contexts. In the three-class classification task, detectors based on LLMs, particularly the Radar detector, significantly outperformed those utilizing the BERT model. The BERT-based detectors’ performance was markedly poor, akin to random guessing, with some models even underperforming a random baseline. This stark contrast underscores the efficacy of LLMs in capturing nuanced distinctions, as demonstrated in tasks like Mixtext. The superior performance of LLM-based Radar detectors lays a solid foundation for future explorations and applications in fine-grained classification tasks.

Current detectors struggle to generalize across different revised operation subsets of MixSet and generative models.

Significant variability is observed in the transfer capabilities of three different detectors. Additionally, training on texts generated by different revised operations results in different transfer abilities for these detectors. Overall, Radar exhibits the most robust transfer capability among the four model-based detectors, achieving an overall classification accuracy exceeding 0.9, followed by GPT-sentinel, DistillBert, and finally, the ChatGPT Detector. Among various operations, ‘Humanize’ exhibits the poorest transfer performance in almost all scenarios. Additionally, other operations also experience significant declines when dealing with ‘Humanize’ mixtexts. This suggests that ‘Humanize’ falls outside the current detectors’ distribution of MGT, a gap that could be addressed by retraining on these specific cases. It is also noteworthy that texts generated by Llama2-70b demonstrate stronger transfer abilities than those generated by GPT4.

Increasing the number of mixtext samples in the training set effectively enhances the success rate of mixtext detection.

However, adding pure text samples does not yield significant improvements and may even have a negative impact on detector performance, especially for metric-based methods. This may be attributed to subtle distribution shifts between mixtext and pure text. The current detector still faces significant challenges in capturing these subtle shifts. For mixtext scenarios, a more powerful and fine-grained detection method is needed.

BibTeX

@inproceedings{zhang-etal-2024-llm,
        title = "{LLM}-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?",
        author = "Zhang, Qihui  and
          Gao, Chujie  and
          Chen, Dongping  and
          Huang, Yue  and
          Huang, Yixin  and
          Sun, Zhenyang  and
          Zhang, Shilin  and
          Li, Weiye  and
          Fu, Zhengyan  and
          Wan, Yao  and
          Sun, Lichao",
        year = "2024",
        url = "https://aclanthology.org/2024.findings-naacl.29"}

LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?