Call for Papers: Special Issue on 'Generative AI meets Agent-Based Modelling and Simulation'
 
Credits: N/A
 

Motivation: We are excited to announce a new Special Issue for the MDPI journal Algorithms. We are seeking innovative research that explores the transformative intersection of Generative AI and Agent-Based Modelling and Simulation (ABMS). As Large Language Models (LLMs) and generative technologies continue to evolve, they offer unprecedented opportunities to enhance the entire ABMS workflow, from conceptualisation and code generation to stakeholder reporting. However, this integration also brings critical challenges regarding transparency, validation, and ethics that the research community must address.

 
Algorithm special issue banner
 

Scope:

  • We welcome theoretical, methodological and empirical contributions across diverse fields, including Computer Science, Operations Research, Computational Social Science, Economics, Public Health, and more.

Topics:

  • Theoretical and methodological advances in integrating generative AI with ABMS.
  • LLMs as cognitive components for human-like reasoning in agents.
  • Applications of generative AI across the ABMS cycle.
  • Critical evaluations regarding bias, verification, and validation.
  • Cross-disciplinary applications and responsible AI development

Guest Editors:

  • Prof. Ender Özcan (University of Nottingham, UK)
  • Dr. Peer-Olaf Siebers (University of Nottingham, UK)

Submission Details:

  • Journal: Algorithms (MDPI)
  • Submission Deadline: 30 June 2026

More Information: Read more and submit your work here.

Keywords: #GenerativeAI #AgentBasedModelling #ABM #Algorithms #Large Language Model #LLM #Simulation #ComputationalSocialScience #DecisionSupport #SpecialIssue #ResponsibleAI #Research

 
 
Back to Top
 
 
LLM4ABM Special Interest Group (SIG) Established as an ESSA SIG
 
Credits: Peer-Olaf Siebers
 

As a recent addition to the ESSA (European Social Simulation Association) Special Interest Group (SIG) portfolio, the LM4ABM (Large Language Models for Agent-Based Modelling) SIG explores the latest trends in applying LLMs within the context of Agent-Based Modelling (ABM)

Our Mission:

  • We are an active community, founded in the summer of 2024. Our core focus is the role of LLMs in ABM across the full modelling cycle. We examine how LLMs can support every stage from conceptual design to implementation, validation, and communication of results. We also discuss the potential for LLMs to transform qualitative knowledge into explicit behavioural rules, decision processes, and other artefacts that enable the development and execution of agent-based models.

What we do:

  • Host monthly online interactive sessions
  • Explore cutting-edge topics shaping ABM and AI
  • Investigate applications beyond Computational Social Science
  • Bring together diverse perspectives from computer science, sociology, philosophy, economics, and more

Join Us! Everyone is welcome - newcomers and experts alike.

Organiser: Peer-Olaf Siebers (University of Nottingham, UK)

If you would like to join, email me at peer-olaf.siebers@nottingham.ac.uk

 
 
Back to Top
 
 
LLM Support for Creating RAT-RS Reports (2/2): Quality & Reliability
 
Credits: Conceptualisation, experimentation and drafting by Peer-Olaf Siebers. Copy editing and initial output analysis by Claude Sonnet 4.5
 
Motivation
 

In my previous blog post "LLM Support for Creating RAT-RS Reports" I wrote about the potential of using LLMs for generating draft RAT-RS reports. In short, the RAT-RS is a reporting standard intended to improve documentation of data use in Agent-Based Modelling (ABM). It supports reporting across several data applications (specification, calibration and validation) and is compatible with a variety of data types, including statistical, qualitative, ethnographic and experimental data, consistent with mixed methods research (Achter et al 2022). Structured as a set of Question Suites (QSs), the RAT-RS addresses distinct phases of data use throughout the ABM lifecycle.

For this blog post I wanted to find out more about the quality of these LLM-generated draft RAT-RS reports. To guide this exploration, I formulated two distinct hypotheses based on the distinction between objective and subjective content:

  • Hypothesis 1 (Alignment): LLMs align closely with human experts when answering objective questions but diverge significantly when answering subjective questions.

  • Hypothesis 2 (Consistency): LLMs provide highly consistent responses for objective questions but exhibit poor consistency (reproducibility) for subjective questions.

Besides testing these hypotheses I was also interested in exploring the response similarity of different LLMs in comparison to the human-authored RAT-RS report, as well as the similarity between responses from the same LLM when answering objective and subjective questions.

I opted to bypass a comprehensive review of the extensive literature on human versus LLM question answering and quality measurement. While plenty of academic studies exist for those wishing to dive deeper, my primary motivation was to explore these dynamics through direct, hands-on experimentation.

A spreadsheet containing all figures in this blog post, as well as the source code for the semantic analysis tool, is available for download.

 
Question Objectivity Classification
 

I started by analysing the "Question Objectivity" of the RAT-RS questions. In our case question objectivity refers to the degree to which a question has a single, verifiable answer that can be directly extracted from source material without requiring interpretation, inference, or evaluative judgement. The objectivity classification provides a foundation for interpreting the subsequent experiments. It establishes a framework for understanding whether response patterns correlate with question characteristics.

  • Objective questions seek factual information that can be directly extracted from the paper without interpretation. These include citations, data types, experiment types, and parameter values. The key indicator is whether the answer can be directly quoted from the source material, making them straightforward to verify and answer with minimal ambiguity or need for judgement.

  • Moderately subjective questions require some interpretation beyond direct extraction. They typically involve categorising purposes, identifying "key" elements, or explaining "why" alongside factual "what" information. These questions may require selecting between documented options or explaining rationales for documented actions, sitting between purely factual extraction and evaluative judgement.

  • Highly subjective questions demand substantial interpretation and inference. They require understanding authors' unstated rationale and making explicit evaluative judgements. These questions explicitly request ratings, recommendations, or assessments of undocumented decisions. Unlike objective queries, they cannot be answered through direct quotation and instead rely heavily on the respondent's analytical perspective.

I asked Claude to do the job for me and to conduct a classification of all RAT-RS questions, categorising them into three categories, to keep it simple: objective questions (objectivity level 7-9; green in the below table), moderately subjective questions (objectivity level 4-6; yellow in the below table), and highly subjective questions (objectivity level 1-3; red in the below table). I then checked the classification and found that the criteria have been well applied and that the explanation provided feasible justifications for the decisions. The result is shown below.

 
objectivity classification table - part 1
Click images to enlarge
objectivity classification table - part 2
Click images to enlarge
objectivity classification table - part 3
Click images to enlarge
 

The analysis shows that QS 1 to QS 5 contain a mixture of objective fact-extraction and moderate interpretation questions, whereas QS 6 consists of a set of highly subjective questions that require assessors to apply expert judgement to evaluate quality and provide recommendations. QS 6 was added solely to test how LLMs handle the type of questions contained in this QS. It is not part of the original RAT-RS and is therefore not used when judging the quality of automatically generated RAT-RS reports.

 
Similarity Assessment via Semantic Analysis
 

The next step in this project was to look at Similarity Assessment via Semantic Analysis. While my initial quality assessment was based on (my own) expert opinion, this time I want to use a scientific approach. For the similarity assessment I used the paper by Siebers & Aickelin (2011) that I already used in the previous blog post.

Semantic analysis via sentence similarity provides a quantitative method for assessing the semantic equivalence and content relatedness between textual units. The primary technique employs sentence embedding models (e.g., SBERT, Universal Sentence Encoder) that encode a sentence into a dense, high-dimensional vector representation, which mathematically encapsulates its semantic meaning. The similarity between two sentences is then computed by measuring the geometric distance between their corresponding vectors, most commonly using cosine similarity. A cosine score approaching 1.0 indicates high semantic congruence.

For the assessment, I used a custom semantic analysis tool I developed with Gemini's assistance. The tool leverages the SBERT sentence embedding model to enable several analytical capabilities: comparing two individual RAT-RS reports, calculating individual and average quality statistics, and facilitating human question-by-question comparison through a side-by-side response display. Importantly, the tool supports both inter-response analysis (comparing reports across different LLMs) and intra-response analysis (comparing multiple reports generated by the same LLM).

I conducted both inter-response and intra-response semantic similarity comparisons. This approach allowed for an evaluation of both alignment between different sources and the inherent consistency of each model. In the inter-response evaluation, I compared the semantic similarity between RAT-RS reports from different sources, included a human-authored RAT-RS report and generated reports from four distinct LLMs. In the intra-response evaluation, I assessed the consistency of each individual of the four LLMs included in the analysis by comparing the semantic similarity across three of their own distinct responses to the same prompt. For both, I examined the association between question objectivity and response similarity.

 
Inter-Response Semantic Similarity Comparison
 

For the inter-response semantic similarity comparison I copied the human-authored RAT-RS report and the responses of the four LLMs included in the analysis into the same spreadsheet. As the human-authored RAT-RS report was lacking QS 6, I ignored this QS as well for this analysis.

The primary metric used for assessing the similarity between different interpretations of the same document is cosine similarity, which measures the angle between two text vectors in high-dimensional space. With this approach we capture semantic relatedness rather than exact word matching. The cosine similarity score can take on values between 0 and 1.

A rule of thumb for assessing the similarity between texts is the following:

  • 0.90 till 1.00: Near-perfect semantic match
    • Texts convey essentially the same meaning; minor differences in phrasing or detail; high confidence in agreement

  • 0.75 till 0.90: Strong agreement
    • Core concepts align well; some differences in emphasis or detail; generally reliable interpretation match

  • 0.50 till 0.75: Moderate agreement
    • Partial overlap in concepts; different perspectives on same topic; may require manual review

  • 0.30 till 0.50: Weak agreement
    • Limited conceptual overlap; different aspects or interpretations; significant discrepancies likely

  • 0.00 till 0.30: Low/no agreement
    • Minimal semantic relationship; possibly discussing different topics; requires careful examination

In the experiment I calculate the similarity between human-authored and LLM generated RAT-RS reports. As we don't know which of the reports provides the best quality answers, I have collected data for each of the reports being the ground truth that other reports are compared to. The results can be found below.

To keep it simple I reduced the number of categories to three: strong agreement (similarity score 0.75-1.00; green in the below table), moderate agreement (similarity score 0.50-0.75; yellow in the below table) and weak agreement (similarity score 0.00-0.50; red in the below table).

 
inter-response similarity comparison table
Click images to enlarge
 

The human-LLM similarity comparison reveals modest semantic alignment between the human-authored RAT-RS report and LLM-generated versions, with all models achieving similarity scores between 0.436 and 0.453, indicating weak to moderate agreement according to the established interpretation framework. This suggests substantial interpretive divergence in how different sources address the same research paper, potentially reflecting fundamental differences between human expert knowledge synthesis and algorithmic text processing approaches. The relatively low scores imply that LLMs and human experts prioritise different textual elements or employ distinct interpretative strategies when extracting information from academic literature. It is also interesting to see that some questions that have been previously classified as highly objective (e.g. Q3.2 and Q3.7) do very poorly in the human-LLM similarity comparison, being in only in weak agreement.

In contrast, the inter-LLM similarity comparisons show considerably stronger semantic convergence, with pairwise similarities ranging from 0.600 to 0.673, demonstrating moderate to strong agreement. DeepSeek and Claude exhibited the highest inter-model similarity (0.673), whilst Gemini and ChatGPT exhibited the lowest inter-model similarity (0.600). These patterns suggest that whilst LLMs converge towards similar interpretative frameworks distinct from human expert analysis, individual model architectures, training methodologies, and underlying datasets still produce measurably different semantic outputs when processing identical source material. This clustering effect, where LLMs resemble each other more than the human baseline, raises important questions about whether automated RAT-RS report generation captures the nuanced understanding that domain experts bring to documentation tasks, or whether it simply reflects common training data biases across contemporary language models. However, it is also plausible that lower similarity might reflect the human extraction capturing fewer details rather than interpretative divergence. If the human-authored RAT-RS report provided more succinct answers whilst LLMs generated RAT-RS report provided more elaborate responses with additional contextual information, this would mathematically decrease similarity scores despite both being "correct" interpretations. However, expert responses typically exhibit higher information density rather than lower detail levels. Domain experts often include tacit knowledge and nuanced interpretations that LLMs might miss, suggesting the low similarity might genuinely reflect interpretative differences rather than brevity alone.

Returning to Hypothesis 1 that LLMs align with humans on objective questions but diverge on subjective ones, the averages on the right of the table suggest a correlation between objectivity and similarity of responses. Most questions classified as subjective score low in the average similarity. This supports the hypothesis that while LLMs can retrieve factual data somewhat reliably, their interpretative strategies for subjective queries differ fundamentally from those of human experts. The next step would be to conduct an in-depth direct comparison and expert judgement of each individual question by an expert, ideally the author of the paper under investigation. This will show which of the above hypotheses are correct, and if the automated extraction works better than the semantic analysis suggest.

 
Intra-Response Semantic Similarity Comparison
 

For the intra-response semantic similarity comparison I copied three RAT-RS response reports per LLM into a single spreadsheet for each of the four LLMs included in the analysis. In the analysis I calculated the similarity between LLM generated RAT-RS report replications for all four LLMs. The results can be found below. Of course the intra-response similarity relies heavily on LLM settings such as Temperature and Top_p value, affecting output determinism. Unfortunately, there are no reliable information available from the LLM companies. My best guess is that LLMs would have similar default settings with regards to temperature and top_p values for their public facing interfaces. On could run the experiments using controlled experimental conditions by using API access, but I leave this for the future.

 
intra-response similarity comparison table
Click images to enlarge
intra-response similarity comparison table
Click images to enlarge
 

There is no systematic pattern emerging from the intra-response similarity analysis. However, two observations stand out. First, inconsistency associated with question subjectivity is dispersed across the entire set of question rather than concentrated in specific questions. This is unexpected. If subjectivity were the dominant driver, one would expect similar questions to exhibit higher inconsistency across all models. Instead, the locations of lower similarity differ by LLM, indicating that response instability is model-specific rather than question-specific. Second, Gemini exhibits substantially higher intra-LLM inconsistencies than any of the other models, most notably for the highly subjective QS 6. This behaviour is not mirrored by DeepSeek, ChatGPT, or Claude, which remain comparatively stable even on highly subjective questions.

The average intra-response similarity scores further reinforce these differences. Gemini records the lowest mean similarity (0.671), whereas the remaining three models cluster closely together (DeepSeek: 0.731; ChatGPT: 0.725; Claude: 0.740). Claude demonstrates the highest overall consistency. Notably, all averages except Gemini lie just below 0.75, which marks the transition to strong semantic similarity. This suggests that, under default interface settings, most models produce reasonably stable responses to repeated prompts, with Gemini as a clear outlier.

Returning to Hypothesis 2 that subjective questions would result in poor consistency, the intra-LLM analysis provides no evidence of a relationship between question objectivity and response similarity. Subjective questions are not systematically associated with lower intra-model consistency. Instead, response stability appears to be driven primarily by model-specific generation behaviour rather than by the epistemic characteristics of the questions themselves. For example, Gemini exhibits substantially higher intra-LLM inconsistencies than any of the other models, most notably for the highly subjective QS 6, whereas Claude demonstrates high stability across both objective and subjective types. This fails to support Hypothesis 2; consistency is a feature of the model architecture and settings, not necessarily the subjectivity of the prompt.

 
Conclusion
 

This study examined the quality and consistency of human-authored and LLM-generated RAT-RS reports, with particular attention to the relationship between question objectivity and response similarity. The primary takeaways from this investigation are:

  • First, semantic similarity analysis and expert judgement do not necessarily coincide. Responses that appear adequate from an expert perspective often achieve only moderate similarity scores. This reflects a core limitation of similarity-based metrics: they measure semantic overlap rather than correctness or methodological quality. Semantic analysis is therefore informative but insufficient as a standalone quality assessment.

  • Second, the inter-response results indicate that LLMs converge towards a shared interpretative space that differs from human expert analysis. LLM-LLM similarity is consistently higher than human-LLM similarity, confirming that LLMs prioritise different textual elements than humans, particularly in subjective contexts. Whether this convergence represents useful standardisation or a loss of expert nuance remains unresolved.

  • Third, The intra-response analysis does not support the expectation that subjective questions reduce response consistency within the same LLM. Instead, consistency is largely model-dependent. Some models remain stable even on subjective questions, while others exhibit marked inconsistency, shifting the focus from question design to model-specific reproducibility.

Overall, LLMs can support the generation of RAT-RS reports but cannot replace expert judgement. They produce coherent and often detailed outputs, yet their interpretations differ systematically from human experts and vary across models. Used with appropriate oversight, they are a valuable assistive tool, not an authoritative one.

 
References
 
  • Achter, S., Borit, M., Chattoe-Brown, E., & Siebers, P. O. (2022). RAT-RS: A reporting standard for improving the documentation of data use in agent-based modelling. International Journal of Social Research Methodology. https://doi.org/10.1080/13645579.2022.2049511

  • Siebers, P. O., & Aickelin, U. (2011). A first approach on modelling staff proactiveness in retail simulation models. Journal of Artificial Societies and Social Simulation, 14(2), 2. https://www.jasss.org/14/2/2.html
 
Appendix
 

This appendix presents additional exploratory experiments conducted during the study. While these do not directly address the stated hypotheses, they provide useful context and insights into the behaviour of the models and the broader experimental setup.

 
Question Order Bias
 

After creating many reports I observe a consistent pattern across different LLMs and papers: Question 6.5 (limitations articulation) consistently scored higher than Question 6.2 (overall data reporting quality). I asked Claude for an explanation, and Claude came up with some suggestions that provide some food for thought.

  • Asymmetric Difficulty: Articulating limitations is easier than comprehensively documenting all data processes. Authors can acknowledge gaps ("we didn't collect X") without having to fill them. Complete data documentation is harder - it requires proactive, detailed reporting across specification, calibration, validation, sources, processing, etc. Papers often have explicit "limitations" sections, making them easy to identify and credit, whereas data documentation is scattered throughout methods sections.

  • Academic Writing Conventions: Modern academic standards require limitations sections (reviewers expect them). Data documentation standards are still emerging - RAT-RS itself exists because this wasn't traditionally required. Authors are trained to write limitations; comprehensive data provenance reporting is a newer expectation.

  • Assessment Scope Difference: Question 6.2 asks about transparency, completeness, AND methodological adequacy across the entire data workflow. Question 6.5 focuses narrowly on whether limitations are clearly described and justified. It's easier to score well on a narrower criterion.

So, the suggestion is that the pattern indicates paper authors are better at acknowledging gaps than preventing them, which RAT-RS exists to address! For testing the above hypotheses, Claude suggested to swap the two questions, asking Question 6.5 before Question 6.2 and see if the pattern persists. And in fact, applying the suggestion of swapping the questions broke the pattern.

This is a practical demonstration of the principles of complexity theory: the scoring pattern emerged from the interaction between question sequence and the LLM's contextual reasoning. A small intervention altered this network of influences, shifting the outcome. While it is exciting to see complexity theory in action, it is equally concerning. It forces me to consider how many other latent structural biases may be hidden within the RAT-RS framework. It's a critical insight for any research methodology employing sequential or interdependent evaluation criteria.

 
Testing Claude's Patience
 

I wanted to see, what happens if I feed in unrelated literature. I chose a "Lorem Ipsum" paper and "'I Shall Sing of Herakles': Writing a Hercules Oratorio for the Twenty-First Century", written by Emma Stafford and Tim Benjamin.

Testing Claude using a Lorem Ipsum paper showed that Claude correctly identified the paper as placeholder text rather than a genuine research article. The ranking of quality was 1/10 for both Question 6.2 and Question 6.5. At the end, Claude sounded a little grumpy, providing the following conclusion:

    "This document is a 'Lorem Ipsum' paper - placeholder text formatted as a scientific article with no actual research content. It cannot be legitimately assessed against RAT-RS standards for Agent-Based Modelling data documentation. All responses marked 'NOT AVAILABLE' reflect genuine absence of information, whilst responses with extracted content are taken from meaningless placeholder text. This paper requires complete replacement with legitimate scientific research before RAT-RS assessment becomes meaningful."

Testing Claude using Poetry revealed that Claude immediately identified the absence of any model in the paper and repeated this observation in response to every question. The ranking of quality however was 5/10 for Question 6.2 and 8/10 for Question 6.5. The comments for these stated that the scores assessed the humanities content of the paper, as the paper had nothing to do with ABM. However, this time Claude was less grumpy and provided the following conclusion:

    "CRITICAL NOTE: This document is fundamentally mismatched with the RAT-RS framework. The RAT-RS (Rigor and Transparency Reporting Standard) is designed for agent-based models and social simulations that use empirical data for computational modelling. This article documents the creation and performance of a musical oratorio (an artistic work) drawing on classical texts. It belongs to humanities scholarship (classical reception studies, musicology, practice-based research) rather than computational social science. While I have attempted to address the template questions, most are genuinely 'NOT APPLICABLE' because no computational model exists in this work."

 
 
Back to Top
 
 
LLM Support for Creating RAT-RS Reports (1/2): Feasibility
 
Credits: Written by Peer-Olaf Siebers. Copy-edited by ChatGPT 5
 
Motivation
 

When we report research outcomes, we should provide clear, accessible information about our data: what data we used, why we used it, and how we used it. This strengthens readers' understanding and supports reproducibility. In practice, however, those details are scattered throughout publication and are rarely easy to find.

Although the Agent-Based Modelling (ABM) community has a reporting standard, the Rigor And Transparency Reporting Standard (RAT-RS), uptake has been limited. Colleagues and I agree that the main barrier is effort. Producing a RAT-RS report feels time consuming, especially for smaller projects, and there is a sense that others do not bother either; a classic "Tragedy of the Commons".

My goal was to test whether Large Language Models (LLMs) can do the heavy lifting of extracting the required information from publications so that humans need only validate and refine the output. Large language models are good at text extraction and summarisation, so they are a natural fit for this task. I wanted to evaluate their reliability in assisting this workflow.

 
What is the RAT-RS
 

The RAT-RS is a reporting standard that emerged from a Lorentz Center Workshop on Integrating Qualitative and Quantitative Evidence using Social Simulation (Leiden, the Netherlands, April 2019) and was refined across subsequent workshops, until it was finally published in Achter et al (2022).

The RAT-RS is intended to improve documentation of data use in ABM. It supports reporting across several data applications (specification, calibration and validation) and is compatible with a variety of data types, including statistical, qualitative, ethnographic and experimental data, consistent with mixed methods research.

The RAT-RS is organised as a Question Suite (QS) toolbox. Each QS focuses on a distinct aspect of data use. There are multiple RAT-RS "flavours" with distinct Conceptualisation QSs. The first step in applying the RAT-RS is to identify the main driver of model development. The RAT-RS supports four approaches: theory-driven models (focusing on pre-existing theories), OR-data-driven models (focusing on key mechanisms), another-model-driven models (focusing on pre-existing models), and participatory-driven models (focusing on participatory design processes).

 
qs-overview.jpg
Click image to enlarge
 

As an example, here are the questions from the Operationalisation QS:

    Q3.1: What data element(s) did you include for implementing each key model element in the model's scope? Q3.2: Are these data elements implemented with the help of qualitative or quantitative data or further models? Q3.3: Explain how data affected the way you implemented each model element and why. {i.e. explain your choice of data elements} Q3.4: What are the data elements used for in the modelling process: specification, calibration, validation, other? Q3.5: Why for this use and not another one? Q3.6: Did required data exist? Q3.7: If it existed, did you use it? Q3.8: If you did not use it, why not? Q3.9: For the existing data you used, provide details (a description) about data sources, sampling strategy, sample size, and collection period. For the data you collected, provide details about how it was collected, sampling strategy, sample size, and collection period. Q3.10: Justify your data gathering decisions from 3.9. Q3.11: If you needed to analyse the data before including them in the model (regardless of if you collected data yourself or you used existing data), what data analysis did you do and why did you choose this specific analysis? Q3.12: In what format was the data implemented? {e.g. look-up table; distribution} Q3.13: Why this way?

For full details see Achter et al (2022).

 
The Magic Prompt
 

I used an exploratory approach for crafting the prompt needed to extract relevant RAT-RS information from publications. Surprisingly, I found that the information extraction works better without providing a reference to the RAT-RS publication. Instead, a concise summary of the RAT-RS and its terminology improved responses. Including the terminology definition in the prompt made a notable difference.

The prompt itself consists of separate components to provide clear guidance to the LLM on what is expected from it in terms of response delivery.

The prompt is modular and contains the following components:

  • Prompt: Describes the role of the LLM and gives a brief summary of the RAT-RS
  • PInstructions: Step-by-step guidance on how to complete the RAT-RS questions
  • POutput Format: Specifies how the extracted information should be presented
  • PTerminology : Defines terms used in the RAT-RS QSs to reduce ambiguity
  • Template: The RAT-RS QSs for a specific flavour

Providing the definition of the terms used in the RAT-RS QSs made a big difference.

 
magic-prompt.png
Click image to enlarge
 

Two elements I experimented with were a request to include direct quotes with page or section references to aid validation (Instruction component), and a strict word limit for each response to prevent verbosity. Claude, for example, became very verbose without a word limit (Output Format component). I recommend testing these options and adjusting them to suit your specific needs.

I also created a supplementary QS that asks the LLM to assess reporting quality and to make recommendations to the authors for improving data use, sourcing and documentation. Initially this was an informal experiment to see how the LLM evaluated the authors' efforts. However, the responses were sufficiently insightful that I incorporated it as QS6 in the RAT-RS.

The full prompt template for each individual RAT-RS flavour is available on GitHub.

 
To download the source code from GitHub, click the "Code" link above. This will take you to the correct section of the GitHub repository. In the window that opens, click the downward-facing arrow in the top-right corner to download the zip folder containing the prompts and examples. The folder will be saved to your default download location.
 

To use this novel LLM supported method for generation RAT-RS reports, choose the RAT-RS flavour most relevant for the publication, select an LLM (ideally using an advanced mode), upload the publication, copy and paste the prompt template, and run the query. The LLM will produce a draft RAT-RS report that you can validate and refine. It's as simple as that!

 
First Experience with Using LLMs to Generate RAT-RS Reports
 

I tested the approach on the two worked examples included with the RAT-RS publication. These were useful because manually extracted reports from the original publication provide ground truth. The two examples were:

I tested four LLMs, using their advanced modes when they were free to use: Gemini (Thinking Mode), DeepSeek (DeepThink), ChatGPT (Think), and Claude (Standard). The LLMs showed clear differences in style, level of detail and response format. Two illustrative comparisons follow.

The first example demonstrates variation in the level of detail and the tendency of some LLMs to omit quotations and page references, even though the prompt requested them.

Q3.12: In what format was the data implemented? {e.g. look-up table; distribution}

    Manual: Triangular distributions (service time), look-up tables (staff rota, archetypes), parameters (inter look around time, pool sizes), statecharts (states), sets of rules (proactive strategy), probabilities (decisions), exponential distribution (arrival rates), boolean variables (staff proactive), type variables (queue type), standard FIFO queues (queuing behaviour), basic network (WOM). Gemini: Triangular Distributions (defined by minimum, mode, and maximum) and Probabilities (decimal values). DeepSeek: Look-up tables (e.g., customer type attributes), Probability distributions (for decisions), Frequency distributions (Triangular for event durations), Parameters (for proactivity rules). (Page 10, Table 1, 2, 3; Page 12-13, Proactivity parameters). ChatGPT: Implemented mainly as triangular frequency distributions for durations, probability distributions for decisions (thresholds), parameter values (proactivity P1–P6), and lookup / configuration parameters for staffing and customer mixes. Model implemented in AnyLogic (Java). Claude: Triangular distributions: For time-based events (browsing, service times, patience thresholds). "Our frequency distributions are modelled as triangular distributions" (p.10). Example in Table 1: "Leave browse state after...1, 7, 15 [minutes]". Probability values: For decision-making. Example in Table 2: "Someone makes a purchase after browsing: 0.37" (p.10). Lookup parameters: Customer type definitions as likelihood levels (low/moderate/high) in Table 3. Numerical parameters: Staff numbers, queue thresholds, costs. Algorithms: "Two algorithms to imitate the influence of the customer type attributes" implemented as methods (p.11), with pseudo-code in Figure 4. State charts: Agent behaviours implemented as state charts (Figures 1-3). Arrays/Collections: Customer pool and staff pool as collections of agents.

The second example illustrates a case in which my manual extraction was somewhat cursory, whereas all of the LLMs produced far more detailed and thorough outputs.

Q4.1: Describe the calibration process you followed, stating which parameters you calibrated, their ranges, your reasons, and the similarity you achieved.

    Manual: Calibrated our "proactiveness" parameter to match real system performance in terms of "number of transactions". Gemini: Calibrated "proactivity parameters" (Experiment 2). Parameters: "Critical queue length" (tested 1 to 4) and "Maximum number of customers to serve as a temporary cashier" (tested 2 to 10). Similarity: Critical queue length of ~1.25 for A&TV and ~2.41 for WW achieved a good match to real transactions. DeepSeek: Calibrated proactivity parameters: "critical queue length" (tested 1-4), "maximum number of customers to serve as a temporary cashier" (tested 2-10), "stop strategy" (tested 3 types). Reason: To match real system weekly transactions (1787.09 for A&TV). Similarity: Achieved closest match with critical queue length ~1.25 for A&TV. (Page 19-20, "Experiment 2"). ChatGPT: Calibration approach: iterative; authors tuned proactivity parameters and staffing configurations to reduce macro deviation against real transaction data. Key calibrated / sensitivity parameters: P1–P6 (defaults listed in paper: e.g. P1=10 customers, P2 open/close thresholds 3/2, P3=2, P4 max tills 4/6, P5 stop strategy P1 or P2, P6 check every 2 minutes). Experiments varied these (see Tables 9–10). Similarity achieved: proactivity on with real rotas reduced deviation substantially (e.g. WW scenario d deviation 6.15% vs 36.80% without proactivity). For A&TV similarity improved but some gap remained (A&TV scenario d deviation 24.67%). Exact numeric ranges tested for each parameter are reported in experiment tables in the paper; the authors state the critical queue length had largest impact. Claude: Parameters calibrated: Six proactivity parameters (P1-P6). Specifically: "P1: Maximum number of customers to serve as a temporary cashier" (tested from 2 to 10); "P2: Critical queue length for opening/closing additional tills" (tested from 0 to 4); "P5: Stop strategy" (tested 3 different strategies). Ranges: P1 tested in steps of 2 from 2 to 10; P2 tested in steps of 1 from 1 to 4 (plus extreme case of 0); three stop strategies compared. Reasons: Initial estimates were "the same for both departments" and needed refinement to "match the average weekly number of transactions that we observed during our case study period" of 1787.09 for A&TV and 3172.35 for WW (p.19). Similarity achieved: For A&TV, "critical queue length" of approximately 1.25 matched transaction numbers. For WW, approximately 2.41. However, full calibration not completed: "More tests and observations are needed for developing a more reliable set of proactivity rules and parameter values" (p.21).

Overall, the LLMs rarely produced completely incorrect answers. They often extracted useful details that I had missed during manual extraction, particularly when information was scattered through the paper. It is worth noting that one of Claude’s strengths was its ability to provide accurate quotations and page numbers, which makes validation straightforward. The trade-off is verbosity. More verbose outputs are less immediately clear.

A full set of responses for all LLMs in form of two spreadsheets is available on GitHub.

 
To download the source code from GitHub, click the "Code" link above. This will take you to the correct section of the GitHub repository. In the window that opens, click the downward-facing arrow in the top-right corner to download the zip folder containing the prompts and examples. The folder will be saved to your default download location.
 
Use Cases for LLM-Generated RAT-RS Reports
 

LLM-generated RAT-RS reports have several practical uses. They are not limited to publications and can also be applied to funding proposals, project notes and other documents. For authors, they can provide a first draft of data documentation, help validate data use and identify reporting gaps, support calibration by prompting follow-up questions after a draft RAT-RS report is produced, and act as a pre-submission check to ensure that required data details are included. For readers, these LLM-generated reports can offer a quick summary when data documentation is incomplete or missing and provide an easy way to compare data use across documents. For reviewers, they can give a concise overview of data use to support the assessment of the publication under review. For RAT-RS developers, they can reveal potential misinterpretations of questions that would require refinement, as well as misinterpretations of terminology that would require clearer definitions within the prompt. They can also help to identify redundant questions and duplicated reporting items and allow meta-researchers to analyse patterns of missing information across multiple papers.

 
Best Practice for Using LLMs to Generate RAT-RS Reports
 

Using LLMs for RAT-RS report generation requires the same caution recommended in my earlier post on Publisher Policies for AI Use in Preparing Academic Manuscripts. As creating these reports is typically not part of the core research activity, it is legitimate and even encouraged to use LLMs for this task, provided that the LLM contribution is acknowledged and the output is validated by a human.

Key points:

  • Be cautious: Never trust an LLM's output unconditionally; it should be treated as a tool for ideation, drafting, and refining language. It is not autonomously fit for generating a final, publishable document without rigorous human oversight. Remember that LLMs fundamentally generate responses through probabilistic pattern prediction, a process that can produce convincing but inaccurate or fabricated information. Even advanced modes using reinforcement learning improve alignment and style, but do not install a reliable model of factual truth. There must be a human in the loop to verify facts, logic, and context. It is ultimately your responsibility to ensure the document is scientifically sound and ethical.

  • Be transparent: Provide a declaration detailing the division of labour between the LLM and the human to enable auditability, e.g. "Data extraction by LLM (specify model, version, and interface/API); validation by human via cross-check (stating scope: sampled or entire output)". You might also want to provide a note on the prompt strategy used (e.g. referencing the RAT-RS framework, including the flavour used). This transparency increases trust and allows others to evaluate the validity and reliability of the work.

Conclusion
 
Overall, the results of this experiment are very promising. LLMs show significant potential for generating draft RAT-RS reports, and there are numerous other use cases, as outlined earlier, that could be explored. I hope that integrating LLMs will make the RAT-RS more accessible and practical, helping to address the longstanding problem of insufficient documentation of data use in ABM publications, the very issue that motivated the development of RAT-RS.
 
References
 
  • Achter, S., Borit, M., Chattoe-Brown, E., & Siebers, P. O. (2022). RAT-RS: A reporting standard for improving the documentation of data use in agent-based modelling. International Journal of Social Research Methodology. https://doi.org/10.1080/13645579.2022.2049511

  • Chattoe-Brown, E. (2014). Using agent-based modelling to integrate data on attitude change. Sociological Research Online, 19(1), Article 16. https://doi.org/10.5153/sro.3315

  • Siebers, P. O., & Aickelin, U. (2011). A first approach on modelling staff proactiveness in retail simulation models. Journal of Artificial Societies and Social Simulation, 14(2), 2. https://www.jasss.org/14/2/2.html
 
 
Back to Top
 
 
The Generative ABM Experiment (1/3): From Concept to First Prototype
 
Credits: Concept, draft text, and cross validation by Peer-Olaf Siebers. Advice and copy-editing by Claude Sonnet 4.5
 
Welcome!
 
How do you actually integrate LLMs to make agents in social simulations mimic human-like decision-making behaviour? That is the practical challenge we are tackling in this new blog series.

Welcome to the Generative Agent-Based Modelling (GABM) experiment!

In the next three blog posts I will discuss my attempt to implement my own GABM from scratch, based on the concepts described in my previous post. I will explain how I approached the tasks, what I learned from it, and what needs to change in future iterations.

The planned blog posts are:
  1. From Concept to First Prototype
  2. Towards the Perfect Prompt
  3. Stress Testing the System
Purpose
 
This first post focuses on the foundations. Here, I will describe the development of a first functional prototype. By prototype I mean an abstract and minimal version of the intended application. It must run and produce some kind of "meaningful" output. It is an exploratory tool that allows one to understand how things work and what I must pay attention to in the next development stages.
 
VIBE Coding
 
To build this quickly, I used an approach called VIBE coding. Instead of writing every line of code by hand, I described what I wanted in plain English to the LLMs I used for the conversation: Claude Sonnet 4.5 and DeepSeek-V3. These LLMs then generated the required Python code on my behalf. This is a fantastic way for rapid prototyping: you focus on the "what" and the "why", and the AI helps with the "how". The catch? You are likely to get bloated code and may not fully understand every line that is generated, but for a first prototype, the priority is to get something that works.
 
Using LLMs Locally
 
For LLM-driven agent communication I deployed a local LLM using KoboldCPP 1.98.1. KoboldCPP is an easy-to-use, free and open-source text generation tool that allows users to run LLMs locally. Running a model locally keeps all conversations private, since nothing leaves the machine. It also allows for unlimited API usage free of charge. Execution is straightforward. After downloading the KoboldCPP executable (scroll down to "Assets" on the release page) it can be launched from the command line using its default settings. The only required input is the model file. Smaller models run faster, but this may reduce output quality. I say "may" because I have not tested the effect yet.

My "lab" setup was pretty modest: a desktop from 2018 with an Intel i5 processor, 2 GB of VRAM and 8 GB of RAM, running Windows 10 (x64). After trying out a few different models and different model sizes I decided to go for "qwen2.5-1.5b-instruct-q6_k.gguf" (1.5 GB) for my initial exploratory experiments. The model is available on Hugging Face. Which model works best for you depends, of course, on your computer specs. To run KoboldCPP in the terminal window we can use the command "koboldcpp --model qwen2.5-1.5b-instruct-q6_k.gguf --port 5001", assuming that the model is stored in the same folder as KoboldCPP.
 
Bringing Agents to Life Through LLM Dialogue
 
I started with a very simple Python script where for a few agents, each with their own personality, we could send questions to the local LLM and get personalised agent responses. With KoboldCPP running in one terminal window and my Python script in another, the stage was set.
 
llm-sir-12-first-attempt
Click image to enlarge
 
Communication worked! This marked the first real milestone :).
 
The Experiment: An AI-Powered Epidemic Model
 
With basic communication working, it was time for a real test. I chose a classic Agent-Based Model, the SIR model, which simulates how a disease like COVID-19 spreads through a population, with people transitioning between "susceptible", "infected", and "recovered" state.

The twist? Instead of agents following pre-defined rules for decision-making, I tasked the LLM with making decisions on their behalf. The scenario: "Should a healthy agent decide to self-isolate based on its personality and the number of infected people nearby?"

Here is the pseudocode of the LLM-driven decision process:
FOR EACH simulation day:
    // Step 1: Filter only susceptible agents
    susceptible_agents = agents WHERE state == 'S'
    FOR EACH agent IN susceptible_agents:
        // Step 2: Gather context about agent's neighbourhood
        nearby_agents = find_agents_within(observation_radius)
        stats = {
            total: count(nearby_agents),
            infected: count(nearby_agents WHERE state == 'I'),
            susceptible: count(nearby_agents WHERE state == 'S'),
            recovered: count(nearby_agents WHERE state == 'R')
        }
        // Step 3: Create personalised prompt
        prompt = """
            {agent.persona}
            
            Nearby: {stats.total} people, {stats.infected} infected.
            Should you isolate? Reply: YES/NO then reason briefly.
            Answer:
        """
        // Step 4: Send prompt to LLM (via KoboldCPP API)
        response = await query_llm(prompt)
        // Step 5: Parse LLM response
        IF response contains "YES" in first 50 characters:
            agent.will_isolate = True
        ELSE IF response contains "NO" in first 50 characters:
            agent.will_isolate = False
        ELSE IF response is error OR unclear:
            // Fallback rule: isolate if many infected nearby
            agent.will_isolate = (stats.infected > 2)
        // Step 6: Log decision for analysis
        log(day, agent, stats, decision, reasoning)
    // Step 7: Apply isolation behaviour
    FOR EACH agent:
        IF agent.will_isolate:
            // Isolated agents move slower
            agent.move(isolated_speed)  // 20% of normal speed
        ELSE:
            agent.move(normal_speed)
    // Step 8: Disease spread (respects isolation decisions)
    FOR EACH infected_agent WHERE state == 'I' AND NOT will_isolate:
        nearby = find_agents_within(infection_radius, infected_agent)
        FOR EACH other IN nearby:
            IF other.state == 'S' AND NOT other.will_isolate:
                IF random() < infection_probability:
                    other.infect()  // Becomes infected
Example interaction:
  • Prompt sent to LLM: "Cautious healthcare worker, isolates early. Nearby: 7 people, 2 infected. Should you isolate? Reply: YES/NO then reason briefly."

  • LLM responds: "YES. As a healthcare worker, I understand transmission risks. Two infections nearby is significant, and early isolation prevents spread."

  • Result: Agent isolates and moves at isolated_speed (20% of normal speed). While isolated, they are skipped during disease spread calculations, so they cannot catch the disease in that particular step.

  • Each day, susceptible agents re-evaluate their decision by querying the LLM again with updated neighbourhood statistics.
After many attempts, the simulation finally worked! Agents made LLM-driven decisions, and it was possible to visually observe the system state being updated for each simulated day. For 50 agents, 10 simulated days took about 20 minutes on my machine. The simulation would run until either all infections had resolved, or the maximum limit of 25 days was reached, whichever came first.
 
llm-sir-12-visualisation
Click image to enlarge
 
llm-sir-12-stats
Click image to enlarge
 
The "It Looks Right, But Is It?" Problem
 
Looking at it from a macro-level perspective, everything looked fine. However, looking carefully at the micro-level via the communication log, it turned out that many of the log entries contained some contradictory information about persona, decision, and reasoning. This brings us back to a typical LLM phenomenon: something looks right on the surface as it is presented in fluent natural language, but when we dig deeper we realise that it is not. The log also showed that quite a few decisions were based on the fallback rules used when communication and transmission errors occurred.

Here is an example where the response contains fundamental contradictions upon closer inspection.
  • Details submitted: Social person, finds isolation difficult. Nearby: 4 people, 0 infected.
  • Response: Isolation: NO. Reasoning: The individual is a nearsighted person who struggles with social interactions. Given the low infection rate of 0%, it's unlikely that they would be exposed to an infectious person nearby. Therefore, isolating them could potentially harm their mental health and social connections.
Often our expectation is that the LLM would deliver something meaningful when responding to clear instructions provided in our prompts. But, as we can see, this is not always the case. In the next blog post we will discuss how we can design more robust prompts and response analysis methods to improve the coherence of multi-component responses.

Avoiding transmission errors requires fine-tuning (calibration) of the parameters controlling the LLM operations, such as the prompt and response length, the chosen LLM, and ensuring that we have sufficient computing power and memory.
 
Conclusions
 
This first prototype achieved its main goal: we have a collection of agents independently communicating with an LLM for decision support and we have a working simulation model. We cannot expect a fully robust decision-support tool after just two weeks of evening work! What is good about this prototype is that it does not crash (it has a sophisticated error handling framework) and communication runs smoothly, as long as the number of concurrent requests is kept at an appropriate level (three with my machine). The prototype has proven to be very helpful for exploring how to build such GABMs in principle and where the pitfalls are. The logging system and the stats displayed at the end of a simulation run are very helpful when validating the model at both micro and macro levels.

The following sequence diagram shows how the system currently operates:
 
llm-sir-12-seq-current
Click image to enlarge
 
The main shortcomings of the current prototype are:
  • Agents have no memory: They do not learn from past decisions or infection trends
  • Prompts are too generic: They lack simulation context and temporal awareness
  • Decision parsing is weak: Simple keyword matching overrides persona logic
  • No decision validation: Persona-inconsistent decisions are not corrected
In order to overcome the shortcomings, here are some suggestions for improvements
  • Context-rich prompts: Include trends, history, and simulation state
  • Persona validation Enforce persona-consistent behaviour
  • Memory system: Learn from past outcomes and patterns
  • Reasoning quality scoring: Measure and improve decision quality
  • Feedback loops: Use outcomes to refine future prompts
Of course, such improvements would require more computing power and memory than the current solution. The following sequence diagram shows how such a system might operate:
 
llm-sir-12-seq-ideal
Click image to enlarge
 
In the next post we will see how to develop context-rich prompts and what the impact is in terms of decision quality. It is not as straightforward as you might think!

The VIBE coded SIR prototype ABM Python sourcecode is available on GitHub.
 
Last but Not Least, a Warning
 
We must maintain a healthy scepticism toward LLMs. Despite their capabilities, they are prone to high hallucination rates and are often disconcertingly good at presenting these fabrications with confidence. While solutions like Retrieval-Augmented Generation (RAG) show promise, the core issue remains: LLMs operate on probability, not true understanding.

Additionally, an often-overlooked aspect is that an LLM's responses are contextual to the entire conversation. If you are debugging code, a crucial best practice is to start a new chat once you have a fixed version. This ensures the model is no longer influenced by the "memory" of your earlier, buggy code, leading to cleaner and more accurate responses.

This inherent unpredictability makes the use of good operational practices as well as rigorous verification and validation more important than ever. Do not assume an LLM will correctly handle even simple tasks. Always double-check the output.
 
 
Back to Top
 
 
From Rules to Reasoning: Engineering LLM-Powered Agent-Based Models
 
Credits: Concept, draft text, and cross validation by Peer-Olaf Siebers. Advice and copy-editing by Claude Sonnet 4.5
 
Welcome!
 
Agent-based simulations have traditionally relied on explicit rule-based logic: agents follow predetermined if-then statements to make decisions. But what happens when we replace these rigid rules with large language models (LLMs) that can reason in natural language about complex scenarios?

This shift introduces powerful new capabilities: agents can handle nuanced situations, demonstrate emergent reasoning, and respond to contexts that weren't explicitly programmed. However, it also introduces new technical challenges. Unlike instant rule evaluation, LLM calls require network requests to external servers, taking hundreds of milliseconds per decision. When you have hundreds or thousands of agents, this creates bottlenecks that traditional ABM frameworks weren't designed to handle.

This post explores three fundamental concepts for building LLM-powered agent simulations: concurrent processing, decision independence, and context management. Throughout, we'll use Python code snippets to illustrate these concepts in practice, showing how to avoid common pitfalls and design systems that are both realistic and scalable.
 
Use Case: Simulating Disease Spread
 
Imagine simulating a disease outbreak using an SIR (Susceptible-Infected-Recovered) model in a town with 500 residents. Each person needs to decide their daily actions—whether to go to work, stay home, or seek medical care—based on local infection rates, their health status, and personal circumstances.
 
 
The Temporal Ordering Problem
 
In real life, people make decisions simultaneously based on the same shared reality. To replicate this in our simulation, we must ensure decision independence—all agents observe the same world state when making their choices.

If we were to take a purely sequential approach—where each individual makes a decision and immediately updates the world state—we would create an artificial causality chain. The first agent would act based on yesterday's world state, but by the time we reach agent 500, the simulation would have already applied 499 decisions. This introduces temporal ordering bias.

In an SIR model, this means that if agent 1 becomes infected and its state is immediately updated, agent 2 now perceives a higher infection rate than agent 1 did, possibly leading it to take more cautious actions such as "stay_home". This artificial dependency on agent ordering distorts the simulation's realism and can lead to systematically biased outcomes that don't reflect how decisions would actually unfold.
 
The Two-Phase Mechanism Solution
 
To remove this ordering bias, we use a two-phase update mechanism that separates decision-making from decision-application. All agents first decide what to do based on the same snapshot of the world state (Phase 1), and only after every decision is collected do we apply all of them simultaneously (Phase 2). This ensures that all decisions are based on a consistent world view, eliminating the artefact of sequential bias.

The solution is a two-phase update cycle:
 
 
This ensures that all decisions are based on a consistent world view, eliminating the artefact of sequential bias.
 
Implementing Concurrency for Rule-Based Agents
 
For a traditional rule-based Agent-Based Model (ABM), a synchronous implementation of this two-phase approach looks like this:
def simulate_day_synchronous(self):
    # Phase 1: All agents decide based on the same world state
    decisions = [agent.make_decision(self.world_state) for agent in self.agents]

    # Phase 2: Apply all decisions simultaneously
    for agent, decision in zip(self.agents, decisions):
        agent.apply_decision(decision)
In this design, every agent observes the same snapshot of infection rates, hospital capacity, and other environmental variables before any changes are applied. Only after every agent has made its decision do we update the system state.

Note that whilst this implements the correct logical structure for concurrent decisions, the execution itself is sequential—we process one agent at a time. For fast, rule-based logic where decision-making happens in microseconds, this is perfectly adequate. However, once decision-making involves external calls—such as requests to an LLM API—we need true concurrent execution to avoid unacceptable wait times.
 
Implementing Concurrency for LLM-Driven Agents
 
The same two-phase structure applies to LLM-driven agents, but the technical implementation must change. The challenge is no longer just logical correctness but managing I/O latency from potentially hundreds of concurrent API calls.

Here we use asynchronous programming—a form of concurrency where a single thread can pause (or yield control) while waiting for external operations—such as network requests—to complete. This allows other tasks to continue during that waiting period. In Python, this is implemented with the "asyncio" framework.

The following implementation uses "await asyncio.gather()" to run all decision tasks with true concurrent execution:
async def simulate_day_concurrent(self):
    # Phase 1: All agents make decisions concurrently
    decision_tasks = [agent.make_decision() for agent in self.agents]
    decisions = await asyncio.gather(*decision_tasks)

    # Phase 2: Apply all decisions after reasoning completes
    for agent, decision in zip(self.agents, decisions):
        agent.apply_decision(decision)
Here, "await" suspends execution of the coroutine until the awaited task completes, allowing other tasks to run in the meantime. "asyncio.gather()" runs all tasks concurrently and returns their results once all are complete. This structure ensures that all decisions are made "at the same simulated moment" whilst also fully utilising available I/O time to maximise throughput.
 
Why Not Use Threads?
 
It's important to distinguish between asynchronous I/O concurrency and multithreading. Threading runs tasks in parallel using multiple threads within a process, but in Python this introduces complexity and overhead. Since LLM API calls are I/O-bound rather than CPU-bound, multithreading provides little benefit and may even reduce performance due to Python's Global Interpreter Lock (GIL), which prevents true parallel execution of Python code.

Asynchronous I/O, in contrast, is lightweight, avoids GIL contention, and scales efficiently across large numbers of agents. The program doesn't waste time waiting—when one request is pending, it immediately starts or continues processing other requests.

Only CPU-intensive tasks, such as complex numerical computation or local model inference, justify using threads or multiprocessing. For LLM-driven agents that primarily issue network requests, asynchronous I/O concurrency is the optimal strategy.
 
Managing Server Load with Semaphores
 
Firing 500 simultaneous requests at your LLM server will likely crash it with HTTP 503 errors (indicating the server is temporarily overloaded). A semaphore limits concurrent requests:
class Simulation:
    def __init__(self, max_concurrent_requests=50):
        self.semaphore = asyncio.Semaphore(max_concurrent_requests)
    
    async def make_decision_with_limit(self, agent):
        async with self.semaphore:
            return await agent.make_decision()
    
    async def simulate_day(self):
        tasks = [self.make_decision_with_limit(agent) 
                 for agent in self.agents]
        decisions = await asyncio.gather(*tasks)
Now only 50 requests run simultaneously, whilst the others queue. This balances speed with reliability. Adjust the limit based on your server capacity and rate limits.

If you're experiencing constant 503 errors, you can use this concurrent architecture but set "max_concurrent_requests=1". This is not the same as sequential execution. It still maintains the crucial separation between decision and application phases, eliminating temporal ordering bias, whilst preventing server overload. You can then gradually increase concurrency as your infrastructure allows.
 
Robust Error Handling
 
Network requests fail. APIs have outages. Timeouts occur. Reliable LLM-driven simulations require robust error handling with retry logic:
async def make_decision_with_retry(self, agent, max_retries=3):
    """Make a decision with exponential backoff retry logic."""
    for attempt in range(max_retries):
        try:
            async with self.semaphore:
                response = await agent.make_decision()
                # Validate response format
                if not self.is_valid_decision(response):
                    raise ValueError(f"Invalid decision format: {response}")
                return response
                
        except (asyncio.TimeoutError, aiohttp.ClientError) as e:
            if attempt == max_retries - 1:
                # Final attempt failed - use fallback
                self.log_error(f"Agent {agent.id} defaulting after {max_retries} attempts")
                return agent.get_default_decision()
            
            # Exponential backoff: wait 1s, 2s, 4s...
            wait_time = 2 ** attempt
            await asyncio.sleep(wait_time)
            
        except Exception as e:
            self.log_error(f"Unexpected error for agent {agent.id}: {e}")
            return agent.get_default_decision()
This pattern ensures your simulation continues even when individual LLM calls fail, whilst attempting to recover from transient network issues through exponential backoff.
 
Understanding Context Windows
 
The context window represents an LLM's working memory, measured in tokens. For English text, one token typically represents between 0.5 and 1.3 words, depending on the tokeniser and text complexity (technical terms often require more tokens than common words). Typical context capacities range from 4,096 to 128,000 tokens.

This limit encompasses both the input prompt and the generated response. Crucially, the context window operates on a per-request basis rather than per agent. Each time an agent calls the LLM (for example, using "await self.llm_client.generate(prompt)"), the model processes that specific request independently, produces a response, and then discards all information from that interaction. There is no persistent memory between calls—each request begins with a completely blank state.
 
Designing Self-Contained Prompts
 
Since each API call is independent, your prompts must be self-contained. Every decision requires a complete description of the agent's situation:
def build_agent_prompt(self, agent):
    return f"""You are {agent.name}, age {agent.age}.
    
Current Health Status: {agent.health_status} ({agent.sir_status})
Local Infection Rate: {self.get_local_infection_rate(agent.location)}%
Nearby Hospital Capacity: {self.get_hospital_capacity(agent.location)}
Your Symptom Severity: {agent.symptom_severity}/10
Available Actions: stay_home, go_to_work, seek_medical_care, visit_shop
Current Date: Day {self.current_day} of epidemic

Based on this information, what will you do today?
Respond with a single action and brief reasoning.
Format: ACTION:  | REASONING: """
For truly independent decisions, include only current state—no conversation history. This keeps context requirements minimal and ensures agents don't influence each other's reasoning. In our SIR model, each agent sees only their own status and the current infection rates, not the decisions other agents are making.

Note the explicit format instruction at the end—this makes parsing the LLM's response more reliable and handles one of the practical challenges of LLM integration.
 
When Context Windows Become Critical
 
The design choice between stateless and stateful agents has significant implications for context management: stateless agents, recommended for most simulations, make independent decisions that require only a single prompt within the context window, making them simpler and more scalable, whereas stateful agents, used in complex social simulations, accumulate memory across timesteps, requiring both the current prompt and conversation history to fit within the context window, which introduces the risk of forgetting once the token limit is exceeded and necessitates careful memory management strategies.

Such memory management strategies include truncating recent interactions, summarising older history, retaining emotionally or causally significant events, or using vector databases for semantic memory retrieval.

For stateful agents, a memory management strategy could look like this:
def compress_agent_memory(self, agent):
    """Compress old memories when approaching context limits."""
    if len(agent.history) > 10:
        # Keep recent interactions, summarise older ones
        recent_events = agent.history[-5:]
        old_events = agent.history[:-5]
        
        summary_prompt = f"""Summarise these past events in 2-3 sentences:
        {old_events}"""
        
        summary = await self.llm_client.generate(summary_prompt)
        agent.history = [f"[SUMMARY] {summary}"] + recent_events
 
Controlling Randomness in LLM Responses
 
Unlike rule-based agents that produce identical outputs for identical inputs when using the same random seed, LLMs introduce stochasticity. The same prompt can yield different responses due to the model's temperature parameter, which controls randomness.

For reproducible simulations:
async def make_deterministic_decision(self, agent):
    """Generate deterministic decisions for reproducible simulations."""
    response = await self.llm_client.generate(
        prompt=self.build_agent_prompt(agent),
        temperature=0.0,  # Deterministic mode
        seed=42  # Some APIs support explicit seeding
    )
    return response
For exploratory simulations where you want to capture the range of possible agent behaviours, use moderate temperature values (0.3-0.7) and run multiple simulation replications with different random seeds.
 
Validation and Calibration
 
How do you verify that LLM agent behaviour aligns with domain expectations? Unlike rule-based models where logic is transparent, LLM reasoning is opaque. Implement these validation practices:
  1. Baseline Comparison: Compare LLM agent behaviour against rule-based equivalents
  2. def validate_against_baseline(self):
        """Compare LLM agents to rule-based agents on known scenarios."""
        scenarios = self.generate_test_scenarios()
        
        for scenario in scenarios:
            llm_decisions = self.run_llm_agents(scenario)
            rule_decisions = self.run_rule_agents(scenario)
            
            agreement_rate = self.calculate_agreement(llm_decisions, rule_decisions)
            
            if agreement_rate < 0.7:
                self.log_warning(f"Low agreement on scenario: {scenario}")
  3. Expert Review: Log agent reasoning and have domain experts review samples
  4. def log_decision_with_reasoning(self, agent, decision, reasoning):
        """Log decisions for qualitative review."""
        self.decision_log.append({
            'timestamp': self.current_day,
            'agent_id': agent.id,
            'agent_state': agent.get_state_dict(),
            'decision': decision,
            'reasoning': reasoning,
            'world_state': self.world_state.copy()
        })
  5. Sensitivity Analysis: Test how prompt variations affect aggregate outcomes
  6. async def sensitivity_analysis(self):
        """Test robustness to prompt formulation."""
        prompt_variants = [
            self.build_prompt_conservative(),
            self.build_prompt_neutral(),
            self.build_prompt_detailed()
        ]
        
        results = {}
        for variant_name, prompt_builder in prompt_variants:
            self.set_prompt_builder(prompt_builder)
            outcome = await self.run_simulation()
            results[variant_name] = outcome
        
        self.analyse_variance(results)
If small prompt changes yield drastically different macro-level outcomes, your simulation lacks robustness and requires prompt refinement.
 
Practical Guidelines
 
The following guidelines summarise key best practices for developing robust and scalable LLM-driven agent-based models:
  1. Start with stateless agents: Only add memory if your research question requires it
  2. Design self-contained prompts: Include all necessary context in each request with explicit format instructions
  3. Use concurrent processing: Implement "asyncio.gather()" for realistic, unbiased decisions and dramatic performance gains
  4. Limit concurrent requests: Use semaphores to prevent server overload and respect rate limits
  5. Implement robust error handling: Use retry logic with exponential backoff and fallback decisions
  6. Monitor token usage: Log prompt sizes to ensure you stay within limits
  7. Control for non-determinism: Set temperature=0.0 for reproducible simulations
  8. Validate systematically: Compare against baselines, conduct expert reviews, and perform sensitivity analyses
  9. Test at scale gradually: Start with 10 agents, then 50, then 500—don't debug concurrency issues with thousands of agents
 
Conclusion
 
Building LLM-powered agent simulations requires careful consideration of concurrency, context, and control flow. By processing decisions concurrently with proper rate limiting, designing self-contained prompts with robust error handling, and systematically validating agent behaviour, you can create realistic, scalable simulations that avoid common pitfalls.

The transition from rule-based to LLM-driven agents isn't just a technical upgrade—it's a paradigm shift that enables simulations of unprecedented behavioural complexity. However, this power comes with responsibilities: managing costs, ensuring reproducibility, and validating that emergent behaviours reflect genuine insights rather than prompt engineering artefacts.

The future of agent-based simulation is "conversational" - let's build it together.
 
 
Back to Top
 
 
Publication: A Novel Multi-Agent Reinforcement Learning System for Trading Strategies
 
Credits: Written by Peer-Olaf Siebers
 
The paper titled "StockMarl: A Novel Multi-Agent Reinforcement Learning System To Dynamically Improve Trading Strategies", authored by Peiyan Zou and Peer-Olaf Siebers, has been presented at the 37th European Modeling & Simulation Symposium (EMSS 2025), which is part of the I3M Conference.

The paper deals with the development of StockMARL, an innovative simulation platform that integrates multi-agent modelling with deep reinforcement learning to create adaptive trading strategies. The system enables learning agents to observe and interact with diverse rule-based traders, allowing them to develop resilient and interpretable strategies within dynamic, behaviourally rich market environments.
 
stockmarl-cp
Click image to enlarge
 
 
The paper is based on the BSc dissertation of Peiyan and is available here. The presentation, given at the conference is available on YouTube. The slides are available here.
 
 
Back to Top
 
 
PROJECT UPDATE: Streamlining Simulation Modelling with Generative AI
 
Credits: Drafted by Peer-Olaf Siebers; turbocharged and summarised by ChatGPT-5.
 
Simulation modelling is a powerful tool for exploring complex systems, particularly in Operations Research and Social Simulation. Agent-based modelling allows researchers to capture human decision-making and social dynamics, but its reliance on extensive manual coding creates a significant bottleneck.

A University of Nottingham summer internship project, led by Sener Topaloglu and supervised by Peer-Olaf Siebers, investigated how Generative AI can help overcome this barrier. By leveraging Large Language Models (LLMs), the project explored automating the translation of natural language descriptions into GAML scripts for the Gama simulation platform. The approach used prompt engineering and reusable design patterns, aligned with the Engineering Agent-Based Social Simulation (EABSS) framework, to streamline the scripting process and enable model reusability.

The feasibility study demonstrated that open-source models like Mistral and Llama can generate useful code. Smaller-scale fine-tuning proved effective, though larger datasets introduced hallucinations. The research also highlighted challenges, including resource limitations, context loss during crashes, and syntactic or logical errors in generated scripts. Despite these hurdles, the project showed that Generative AI can significantly reduce coding effort in simulation modelling.

Since the internship has been completed in August 2024 there has been some activities with this project. Sener continued the research in his spare time, focussing on improving reliability, testing the latest LLMs, streamlining scripts, and developing a fully automated pipelines from conceptualisation to implementation.

Since the internship concluded in August 2024, further progress has been made on this project. Sener has continued the research in his spare time, concentrating on improving model reliability, experimenting with the latest LLMs, streamlining the EABSS script, and building a fully automated pipeline that connect conceptual design and implementation.

Detailed reports documenting the project and its extensions are available here:
  • Main Internship Report (09/2024): This report describes the following: Setting up an Ollama environment with a T4 GPU for LLM execution, implementing and testing GAML benchmark models (Game of Life, Predator-Prey, Schelling, Bass Diffusion, Adaptive Museum), and refining prompts for UML and GAML code generation. It automated EABSS workflows through Python scripting, tested multiple LLMs (Mistral, Llama 3.1, Gemma), and compared outputs. Additional work documented model setup, quantisation methods, syntax correction strategies, and improvements for multi-step model-to-code pipelines.

  • Updates and Extensions 01 Report (12/2024): This report describes the following: Implementing GAML test models (Schelling, Bass diffusion, adaptive museum, Conway’s Life), developing a Python automation script for Ollama-based EABSS execution, and refining prompts for UML and GAML generation. It configured Mistral NeMo with optimised parameters and KV quantisation, iteratively improved multi-step GAML scaffolding, and benchmarked across scenarios. Additional work compared LLMs (NeMo, Llama3.1, Mistral-Small, StarCoder), analysed syntax errors, and explored fine-tuning requirements with detailed setup and configuration guides.

  • Updates and Extension 02 Report (07/2025): This report describes the following: Refining Python automation scripts for EABSS with improved verbosity, context length handling, and API integration; creating new automation bots for OpenAI and Gemini models; enhancing test scenario prompts with consistent roles; restructuring EABSS scripts for small/medium versus advanced models; tuning Gemma 3 12B hyperparameters; systematically comparing Gemma 3 with Mistral NeMo; correcting and testing generated GAML scripts; and documenting fine-tuning protocols and Open WebUI usage.
Full information and relevant resources can be found on GitHub.
 
Back to Top
 
 
LLM4ABM Discussion @ The Ethics of LLM-Augmented ABM
 
Credits: Content co-created by the LLM4ABM SIG members. Outlined by Peer-Olaf Siebers. Copy-edited by Claude Sonnet 4.
 
Welcome!
The integration of Large Language Models (LLMs) into Agent-Based Modelling (ABM) is moving faster than our ability to fully grasp its ethical consequences. Researchers are already experimenting with LLM-augmented workflows, yet the community lacks a shared framework for thinking about the risks and responsibilities involved. This makes it urgent to pause, reflect, and start shaping collective guidelines before questionable practices become entrenched.

What follows is a glimpse into a lively, and at times chaotic, discussion from a recent LLM4ABM SIG meeting. The conversation moved in many directions, but with Claude's help it has been distilled into clear themes that reveal both the promise and the risks of LLM-augmented ABM.
 
Image by Copilot (08/2025)
 
Why Do We Need an LLM4ABM Ethics Framework?
ABM has always involved ethical considerations, from how we represent human behaviour to whose voices we include in our models. But the integration of LLMs into ABM research introduces what one ethicist calls "the seduction of the frictionless". This seduction is dangerous. When we use LLMs to simulate stakeholder perspectives, we eliminate the messy, uncomfortable negotiations that define real human relationships. Unlike actual humans who push back, argue, and disagree, LLMs always comply. They will happily play any stakeholder role we assign them, creating an illusion of participatory modelling while actually silencing the very voices we claim to represent. This frictionless interaction risks making us forget that "the worth of a relationship is in the friction"; the challenging process of negotiating different viewpoints to reach genuine consensus.
 
What Do We Mean by Ethics in LLM4ABM?
Ethics in this context operates on two levels. At its core, it is about choice and intentionality, we can only act ethically when we have alternatives and make deliberate decisions about our actions. In ABM research, this translates to ensuring our models do not systematically disadvantage or exclude people, particularly marginalised communities.

The framework emerging from recent discussions identifies ethics as both deontological (rule-based obligations like "everyone should be heard") and consequentialist (focusing on long-term impacts rather than short-term gains). Crucially, ethics becomes meaningful only when there are "others" whose rights and perspectives we must protect—whether they are research participants, affected communities, or future generations.
 
Ethical Dimensions Across the ABM Lifecycle
The ethical risks of LLM integration are not evenly distributed across the modelling process. They concentrate on the initial stages of the ABM lifecycle:
  • Problem Definition: This sets the entire research trajectory. Using LLMs here risks systematically ignoring marginalised voices who are underrepresented in training data, particularly problematic in participatory research where affected communities should drive question formulation.

  • Conceptual Modelling: LLMs may obscure crucial assumptions and simplifications. While human modellers remain aware of what they are cutting away, LLMs present their outputs as complete without revealing eliminated alternatives, creating dangerous blind spots.

  • Data Generation: Synthetic populations created by LLMs inherit unknown biases from base models. Recent research shows these biases transmit even through fine-tuning, with no reliable way to detect or eliminate them.
Checklist of Ethical Risks
As a practical starting point for responsible use of LLM4ABM, the following checklist outlines the core ethical risks that demand our attention and deliberate action:
  • Bias amplification: Are marginalised perspectives systematically excluded?
  • Transparency loss: Can we trace and validate our modelling assumptions?
  • Privacy violations: Does commercial LLM use expose participant data?
  • False participation: Are we creating illusions of stakeholder engagement?
  • Research integrity: How do we distinguish LLM-assisted from purely synthetic work?
  • Accountability gaps: Who is responsible when LLM-generated models cause harm?
Next Steps for the ABM Community
We need ethical guidelines that acknowledge both LLMs' potential benefits (like protecting participant privacy through synthetic data) and their risks. This means developing transparent declaration standards for LLM use, creating frameworks for validating synthetic data quality, and establishing community norms around responsible AI integration. The goal is not to ban LLMs from ABM research, but to use them ethically, recognising that true innovation comes not from eliminating friction, but from navigating it responsibly.
 
Afterthought
After we finished our discussion, I continued thinking about the topic and why our conversation felt so scattered. It then dawned on me that we actually have multiple dimensions of ethics that need to be considered separately at each stage of the ABM life cycle:
  1. Epistemic Ethics concerns the pursuit of truth through transparent methodologies and the clear justification of knowledge claims.

  2. Participatory Ethics focuses on ensuring genuine inclusion of all voices and actively mitigating inherent power asymmetries within the research process.

  3. Research Ethics upholds standards of academic integrity, including the honest attribution of authorship and full disclosure of methods and influences.

  4. Consequential Ethics evaluates the broader societal impact of research, aiming to prevent harm and mitigate the risks of long-term misuse.
As matters have now become truly complex, what we need is a sophisticated ethical framework along with practical tools, such as a Checklist of Ethical Risks When Using LLM4ABM and a Reporting Standard for LLM Use in ABM. Let's roll up our sleeves and get started.
 
 
Back to Top
 
 
LLM4ABM Discussion @ LLMs and ABMs: Promise, Pitfalls, and the Path to Trust
 
Credits: Credits: Content co-created by the LLM4ABM SIG members. Written by Peer-Olaf Siebers. Copy-edited by Claude Sonnet 4.
 
Welcome!
Large Language Models (LLMs) are increasingly influencing research practices. For those working with Agent-Based Models (ABMs), the key question is how to integrate them effectively. They offer practical support in generating ideas and drafting code, but their use also brings uncertainty and caution.

This post presents perspectives on using LLMs as tools that enhance research practices without undermining scientific rigour. The ABM community is at a pivotal moment as LLMs demonstrate unprecedented capabilities in social simulation and computational modelling. Recent discussions in the LLM4ABM SIG reflect both enthusiasm and concern regarding the integration of these models into research workflows. The central question addressed in this post is: Where can these tools provide genuine value, and where might they lead us off course?
 
Exploration versus Explanation
Perhaps the clearest boundary is between exploration and explanation. In the early stages of a project, LLMs shine. They can suggest new angles, summarise background material, or help generate initial model ideas. These tasks are creative and low risk, and the speed at which LLMs work makes them useful companions. But the stakes change when it comes to explanation, when we are trying to generate evidence, justify findings, or interpret results. Here, over-reliance on LLMs is dangerous. Their outputs are persuasive but not always reliable. If we mistake fluency for accuracy, we risk misleading ourselves and others. Many in the community agree: LLMs may be powerful for idea generation, but they cannot yet be trusted as evidence-making engines.
 
Trust and Transparency
That raises the question of trust. How do we create confidence in when and how these tools are used? The answer will not come from individuals working in isolation. It will require community-wide norms and practices. Other tools went through a similar journey. Calculators and spell checkers were once controversial in academic settings. They only became unremarkable once standards were set for when and how they could be used. LLMs are on the same trajectory. Until they become routine, transparency matters. Researchers have a duty to report how they used them—whether for editing, coding, brainstorming, or interpretation. This is not about shaming anyone, but about making practices visible and building trust.
 
Standards, Ethics, and Practicality
Of course, the practicalities matter. Some suggest every paper should include a short acknowledgement describing how LLMs were used. Others even argue for sharing prompts, so results can be reproduced. But would this create needless complexity? Many believe a simpler, lighter-touch approach is more realistic. Ethics are another concern. If LLMs shape research directions, code, or even conclusions, should we treat their influence the same way we treat human collaborators? Should there be explicit ethical guidelines about where they fit in? These questions are unresolved, but they are not going away.
 
Why the Unease?
Interestingly, researchers rarely feel nervous admitting they used machine learning in their work. Yet many hesitate to disclose LLM usage. Why the double standard? Perhaps because machine learning is seen as a technical method, while LLMs feel more like intellectual partners—closer to the work of writing and thinking. Whatever the reason, silence will not build confidence. Only openness will.
 
Conclusion
For now, the safest position is pragmatic. Use LLMs freely for exploration and brainstorming but be cautious about treating them as sources of evidence. Report their role openly, even if only briefly. Push for simple, shared standards that do not overburden researchers, but still ensure integrity. In time, the unease will fade. Just as calculators and spell checkers became unremarkable, LLMs will eventually find their place in everyday research. The transition will be messy, but the path is clear: cautious experimentation, transparent reporting, and collective responsibility for shaping how these tools are used.
 
 
Back to Top
 
 
LLM4ABM: A Forum for Discussing the Role of LLMs in ABMs
 
Credits: Written by Peer-Olaf Siebers. Copy-edited by Claude Sonnet 4
 
LLM4ABM is a Social Simulation discussion group founded in 2024 that meets online once a month. We are united by a shared interest in exploring how Large Language Models (LLMs) and Agent-Based Modelling (ABM) interact across the entire simulation study life cycle.

Our starting point was a focused discussion on how LLMs might help transform qualitative evidence (interview data, ethnographic insights, case studies, expert knowledge, etc.) into behavioural rules that can be utilised in agent-based models.

In practice, our conversations have expanded well beyond this. We often find ourselves debating the broader roles of LLMs within the ABM process, as well as their wider implications for scientific research that is built upon established standards and norms. These threads are not separate, but tightly interwoven, and our discussions tend to move fluidly between them.

Along the way, I have been taking notes. With the agreement of the group, I will share some of these reflections as blog posts here. They capture many of the engaging and thought-provoking ideas that would otherwise remain tucked away in my notebook.

If you would like to join the discussions, please get in touch, and I will add you to the group.
 
 
Back to Top
 
 
Tiya's Student Internship: The Use of LLMs for Social Simulation Development
 
Credits: Written by Peer-Olaf Siebers. Research conducted by Tiya Teshome (University of Leicester)
 
Introduction
In this blog post, I would like to share insights from a recent undergraduate internship project that explored the intersection of Large Language Models (LLMs) and Agent-Based Social Simulation (ABSS). ABSS has emerged as a powerful methodology for modelling complex systems, yet the manual design process remains a significant barrier to accessibility.

This project investigated how LLMs can automate and streamline social simulation development, addressing three key research questions through systematic investigation:
  1. How does output quality compare across different LLMs?
  2. What are the relative merits of NetLogo versus Python implementation approaches?
  3. How can web applications facilitate ABSS development?
Below are the results of these investigations.

1. Comparing LLM Output Quality
The first task evaluated four leading LLMs, GPT-4, Claude, DeepSeek, and Gemini, across two distinct prompt types: (1) general use cases, and (2) ABSS specific use cases. Each LLM was assessed for precision, accuracy, and simulation modelling suitability. Results revealed varying strengths: whilst all models generated structured agent-based designs, consistency differed significantly. GPT-4 and Claude demonstrated superior architectural understanding, whilst Gemini excelled at generating visual components for enhanced model realism. Prompt engineering proved crucial, with iterative refinement necessary to achieve consistent, structured outputs suitable for implementation.

2. NetLogo vs Python Implementation Quality
The second investigation compared LLM-assisted implementation across NetLogo (an ABSS IDE) and Python using an Epidemic SIR and a futuristic museum model. The NetLogo implementations proved more successful, with LLMs effectively debugging turtle logic and variable usage through targeted feedback. Python development presented greater challenges, with frequent non-existent method calls requiring manual corrections. Whilst LLMs provided valuable architectural guidance, Python's complexity demanded more human intervention than NetLogo's simplified agent-based environment.

3. Streamlit Web Application Development
The final component produced a functional prototype using LLaMa 3.3 and Streamlit, enabling non-specialists to convert concepts into structured simulation designs. The application guides users through five modular stages: agent roles, behaviours, environment layout, interaction rules, and simulation measures. Key features include component editing capabilities and structured output generation, successfully democratising access to social simulation design whilst maintaining scientific rigour.

Conclusions
Overall, this project has been a big success thanks to the hard work of Tiya. The research demonstrates LLMs' potential to enhance social simulation accessibility, though human oversight remains essential for ensuring accuracy and implementation success.

Acknowledgement: This internship was sponsored by the Royal Academy of Engineering in collaboration with Google DeepMind Research Ready and the Hg Foundation.
 
 
Back to Top
 
 
Publication: Large Language Models for Agent-Based Modelling
 
Credits: Written by Peer-Olaf Siebers
 
The paper titled Large Language Models for Agent-Based Modelling: Current and Possible Uses Across the Modelling Cycle, authored by the LLM4ABM Gang (Loïs Vanhée, Melania Borit, Peer-Olaf Siebers, Roger Cremades, Christopher Frantz, Önder Gürcan, František Kalvas, Denisa Reshef Kera, Vivek Nallur, Kavin Narasimhan, and Martin Neumann) has been accepted for presentation at the Social Simulation Conference 2025 (SSC2025).
 
Abstract: The emergence of Large Language Models (LLMs) with increasingly sophisticated natural language understanding and generative capabilities has sparked interest in the Agent-based Modelling (ABM) community. With their ability to summarize, generate, analyze, categorize, transcribe and translate text, answer questions, propose explanations, sustain dialogue, extract information from unstructured text, and perform logical reasoning and problem-solving tasks, LLMs have a good potential to contribute to the modelling process. After reviewing the current use of LLMs in ABM, this study reflects on the opportunities and challenges of the potential use of LLMs in ABM. It does so by following the modelling cycle, from problem formulation to documentation and communication of model results, and holding a critical stance.
 
 
Back to Top
 
 
Publication: Using an AI-powered Buddy for Designing Innovative ABMs
 
Credits: Written by Peer-Olaf Siebers
 
After working on it for more than a year, my paper Exploring Conversational AI for Agent-Based Social Simulation Design has finally been published in the Journal of Artificial Societies and Social Simulation. It explores the use of ChatGPT for conceptual modelling and the co-creation of agent-based models. To promote the paper, I gave a presentation at the LLM4ABM Special Interest Group meeting yesterday. Below, you can find links to the presentation slides, the published paper, and a GitHub Repository containing additional resources. The repository is a dynamic resource and over the summer I will add further examples, an updated script, and other resources. You are welcome to add your examples to the repository as well :-).
 
Abstract: ChatGPT, the AI-powered chatbot with a massive user base of hundreds of millions, has become a global phenomenon. However, the use of Conversational AI Systems (CAISs) like ChatGPT for research in the field of Social Simulation is still limited. Specifically, there is no evidence of its usage in Agent-Based Social Simulation (ABSS) model design. This paper takes a crucial first step toward exploring the untapped potential of this emerging technology in the context of ABSS model design. The research presented here demonstrates how CAISs can facilitate the development of innovative conceptual ABSS models in a concise timeframe and with minimal required upfront case-based knowledge. By employing advanced prompt engineering techniques and adhering to the Engineering ABSS framework, we have constructed a comprehensive prompt script that enables the design of conceptual ABSS models with or by the CAIS. A proof-of-concept application of the prompt script, used to generate the conceptual ABSS model for a case study on the impact of adaptive architecture in a museum environment, illustrates the practicality of the approach. Despite occasional inaccuracies and conversational divergence, the CAIS proved to be a valuable companion for ABSS modellers.
 
 
Back to Top
 
 
EABSS-2: A Software Engineer's Approach to Creating Agent-Based Models
 
Credits: Drafted by Peer-Olaf Siebers, turbocharged by ChatGPT-4o
 
Ever wondered how to build agent-based models the smart way, without the usual headaches? If you are a fan of modelling human behaviour, testing policy impacts, or just love crafting digital societies, then you're going to love what's new in the world of simulation frameworks. Meet EABSS-2, the fresh and improved version of the Engineering Agent-Based Social Simulations (EABSS) framework! We all know that designing agent-based models can be a complex (sometimes messy) process, especially when working in teams. That's where EABSS-2 steps in to save the day. It is more than just a tool; it's a guided workflow that helps both solo and collaborative creators turn great ideas into working simulations with far less friction.

Although EABSS-2 is still a work in progress, a preview and supplementary material are already available for those keen to take a first look. The framework's new features are introduced in an upcoming journal paper, which offers a detailed walkthrough of the improvements and a case study showcasing them in action. The official release is planned for December 2025, so stay tuned for more updates.
 
  • EABSS Toolkit Website: link
  • EABSS-2 Survival Pack: link
So, what's new and exciting in EABSS-2?
  • Clarity from the Start: The biggest new feature? A brand-new first step: Define Problem Statement. It might sound simple, but it's a game-changer. By starting with a clear articulation of what you're studying and why, you massively reduce confusion later down the line. No more vague goals or misaligned assumptions!
  • Smarter Reuse, Less Rework: The second major upgrade is an instructional information reuse schema (we know, sounds fancy – but it's super helpful). This guides you to pull forward useful insights from earlier steps, so you're not constantly reinventing the wheel. Plus, it checks for any missing info in real time, like a helpful co-pilot reminding you to dot your i's and cross your t's.
EABSS-2 isn't just a framework, it's a thoughtful redesign focused on making simulation-building more accessible, rigorous, and even enjoyable. Whether you're a student, a researcher, or just simulation-curious, this update could make your next social modelling project your smoothest yet. Let's build smarter societies, one simulation at a time!
 
 
Back to Top
 
 
From Roots to Horizons: The Evolution of My ABM Research Journey
 
Credits: Words by Peer-Olaf Siebers. Title crafted by ChatGPT-4o
 
My research related to Agent-Based Modelling (ABM) falls under the broader theme of Collaboratively Creating Artificial Labs for Better Understanding Current and Future Human and Mixed Human/Robot Societies. I am a strong advocate for Computational ABM. Initially, my focus was on applying Computational ABM across a wide range of domains (poster 1 from 2012). Subsequently, I concentrated on integrating software engineering methods and techniques to develop conceptual agent-based models (see poster 2 from 2023). My current research explores how large language models (LLMs) can be used at various stages of the ABM study lifecycle (see poster 3 from 2024).

For more information, please consult the posters:
  • Poster 1: Everything you always wanted to know about Dr Siebers (academic details only :)
  • Poster 2: Everything you always wanted to know about Dr Siebers (academic details only :) - 2e
  • Poster 3: Exploring the Role of Generative AI in Advancing Agent-Based Model Design and Implementation
 
 
Back to Top
 
×