Artificial Intelligence as an Evaluator: What Textual Features Predict Success in Research Assessment?

From research

We have all heard the advice on how to write academic papers countless times: “Write simply. Be concise. Avoid unnecessary jargon.” But what if evaluators—and increasingly, algorithms in the future—actually appreciate the exact opposite? Our team at the Faculty of Informatics and Statistics at the Prague University of Economics and Business decided to replace impressions with data.

In a study published in the journal Machine Learning, we test a method that aims to open the “black box” of artificial intelligence and uncover what distinguishes top-tier papers from average ones.

Teaching AI to Explain Itself (Methodology)

The core idea of our experiment was simple. Most current AI models for text analysis (such as BERT) operate opaquely—they transform text into complex numerical representations that are unreadable to humans. We chose a different path. Instead of using large language models (LLMs)—specifically Llama 2 and GPT-4—as judges, we used them as intelligent “readers.”

We tasked the models with extracting interpretable features from texts. We asked concrete questions: Is the methodology rigorous? Does the text include statistical analysis? Is the language complex? This allowed us to obtain structured data, which we then used to train transparent classification models. This approach—known as LLM-based feature generation—enabled us to peek inside what may influence evaluation outcomes. In a second experiment, we went even further and asked the models to suggest which features might matter most to evaluators themselves. The results are summarized in the following section.

A Laboratory Called M17+

Our main testing ground was data from the Czech national evaluation of research organizations conducted under Methodology 17+ (Module 1). We used publicly available abstracts of academic papers that had been assessed by expert panels. The dataset consisted of 2,000 abstracts, evenly balanced across final grades—from top quality to average or below-average results.

We were interested in identifying which textual features were most strongly associated with the final score. The SHAP (Shapley Additive Explanations) plot below illustrates which features had the greatest impact on the model’s predictions:

Figure 1: What determines success in M17+? The plot shows the influence of individual features on the final score (1 = best, 5 = worst). Each dot represents a single article. Red indicates the presence of a feature, blue its absence. The X-axis shows the shift in evaluation: left means a better score (closer to 1), right a worse one.

A closer look at the data brings some perhaps surprising insights—especially for advocates of simplicity:

The complexity paradox: For the feature Language Complexity_Complex, red dots cluster clearly on the left. Texts that the LLM evaluated as linguistically complex and dense had a higher chance of receiving top scores than those written in a simpler style. In the M17+ environment, an expert tone may function as a signal of quality.

Statistics are a must: The absence of statistical analysis (Statistical Analysis_None, red dots on the right) penalizes a paper.

Confidence pays off: A high declared research impact (Research Impact_High) pushes evaluations toward better grades.

Disciplinary nuances: The sample also suggests differences across disciplines—for example, biology (Research Discipline_Biology) showed a slight advantage over engineering fields.

It is important to note that the model identifies statistical associations that do not necessarily represent a causal link.

The Price of Interpretability and General Applicability

One might ask whether this “transparency” comes at the cost of accuracy. When we compared our approach with TF-IDF and with SciBERT (an opaque “black-box” model in which it is difficult to explain individual decisions), the results were encouraging: we achieved performance comparable to SciBERT and often outperformed simpler methods. In other words, our model achieves results similar to those of significantly more complex approaches, while at the same time making it easier to understand how those results were obtained. This shows that it is possible to remain competitive without sacrificing explainability.

To test the method’s transferability, we applied it to five very different datasets. In addition to research evaluation and medical texts (CORD-19), we tested it on banking query classification (Banking77), hate-speech detection, and the Food Hazard dataset (reports of dangerous foods). The model adapted without complex retraining: instead of methodology or novelty, it began to focus on features such as toxin type, allergens, or contamination severity. This supports the conclusion that semantic feature extraction is applicable across domains.

Conclusion: Simplicity That Opens Doors

The strength of this approach lies in its straightforwardness—something anyone can try. The process has three steps: take a text, show it to a language model with a few examples so it can propose relevant features (so-called feature discovery), and then let the model fill in these feature values across the dataset. The result is structured data ready for transparent “white-box” methods, such as decision trees. The outcome is not just a prediction, but a deeper understanding of what really matters.

The full paper is available for a limited time at: https://rdcu.be/eIXYH

Artificial intelligence tools were consulted during the preparation of this popular science article.