Research Methods and Professional Practice

Introduction

Artificial Intelligence (AI) is increasingly reshaping software development, enabling programmers to write code more efficiently and effectively by automating tasks and assisting in complex problem-solving (Stray et al., 2025; Treude and Gerosa, 2025; Tabarsi et al., 2025). One of the most significant advancements in this domain is the emergence of large language models (LLMs), such as OpenAI’s ChatGPT and GitHub Copilot. These models, based on the revolutionary transformer architectures introduced by Vaswani et al. (2017), have been trained on massive volumes of code and natural language to generate, complete, and interpret complex code snippets (Chen et al., 2021; Rasnayaka et al., 2024). Examples of LLM-based tools used in software engineering include GitHub Copilot, ChatGPT, and CodeBERT (Chen et al., 2021; Chatterjee et al., 2024; Negri-Ribalta et al., 2024). These tools have demonstrated remarkable capabilities in tasks such as code generation, bug detection, documentation, and test case creation (Nettur et al., 2025; Yetistiren et al., 2023; Tang et al., 2023).

The introduction of LLMs in software development has sparked a wide range of reactions from the developer community. On one hand, there is excitement about the potential of these tools to increase productivity, enhance code quality, accelerate onboarding of new developers, and automate routine tasks (Noy and Zhang, 2023; Chatterjee et al., 2024; Nettur et al., 2025). On the other hand, concerns have been raised about the creation of unsafe or inaccurate code, the decline of traditional coding skills, and the moral and legal implications of using AI-generated code (Negri-Ribalta et al., 2024; Tadi, 2025; Stray et al., 2025).

This literature review comprehensively examines the impact of LLMs on software development, focusing specifically on their roles in code generation, debugging, and real-time developer assistance. This review deliberately focuses on LLM-based tools and excludes general natural language processing applications or older, non-LLM based AI coding tools. The review is structured thematically, beginning with an analysis of productivity outcomes, followed by an exploration of security and reliability concerns, and concluding with a discussion on human-AI collaboration and integration best practices.

Capabilities and evolution of LLMs in software development

LLMs are a recent advancement in artificial intelligence that have revolutionized software development, particularly in code generation and understanding (Rasnayaka et al., 2024). These models are built upon the transformer architecture and the concept of “attention” introduced by Vaswani et al. (2017), which enables them to handle long-range dependencies and complex structures in both natural and programming languages (Negri-Ribalta et al., 2024).

These sorts of models are trained on countless datasets, including natural language text and code, allowing them to learn patterns and relationships across distinct domains. OpenAI’s Codex, for instance, was fine-tuned on 54 million public GitHub repositories, enabling it to generate and complete code based on prompts written in plain English (Tang et al., 2023; Nettur et al., 2025). A high-level description of how these models are trained involves breaking down the input into tokens and teaching the model to predict the next token in a sequence, gradually learning abstract patterns and representations in a latent semantic space (Negri-Ribalta et al., 2024).

Their capabilities now apply to a wide range of programming tasks. LLMs assist with auto-completion, code generation, bug detection, documentation, testing, and project management (Treude and Gerosa, 2025; Rasnayaka et al., 2024). They offer real-time suggestions, generate entire frameworks, and help with summarizing or refactoring code (Nettur et al., 2025). This makes them valuable not only for novice programmers, but also for experienced developers handling repetitive or low-level tasks such as unit tests and database queries (Tabarsi et al., 2025).

The evolution of LLM-based tools has been rapid. Early solutions focused on template-based generation and boilerplate code but have advanced to models capable of writing complex, context-aware functions (Nettur et al., 2025). Leading models include OpenAI’s Codex and the ChatGPT series, which are based on GPT-3.5, GPT-4 and the multimodal GPT-4o, along with CodeParrot, StarCoder, CodeBERT and CodeT5 (Negri-Ribalta et al., 2024). Commercial tools like GitHub Copilot and Amazon CodeWhisperer incorporate these models into real-world developer environments (Chatterjee et al., 2024).

Impact on developer roles at different experience levels

The impact of these assistants on developers varies depending on their level of experience, particularly between junior and senior software engineers. These models influence not only how a code snippet is written but also how the developer approaches the task, structures the workflow and interacts with the code (Nettur et al., 2025; Rasnayaka et al., 2024; Stray et al., 2025).

For junior developers, LLMs can act as collaborative instructors, helping them learn unfamiliar languages, frameworks, and debugging techniques through features such as code explanation and autocomplete (Nettur et al., 2025). They are especially helpful for understanding simple language syntax and repetitive logic, lowering the entry barrier to programming (Negri-Ribalta et al., 2024). Copilot, for example, has been shown to boost productivity in novice users by over 50%, though these gains often come with increased time spent reviewing and validating AI suggestions (Chatterjee et al., 2024; Stray et al., 2025). However, there are concerns about over-reliance. Less experienced developers may struggle to validate the correctness of generated code and might adopt suboptimal or incorrect solutions without even realizing it (Stray et al., 2025; Rasnayaka et al., 2024).

In contrast, senior developers tend to use LLMs more carefully and systematically. Their expertise with best practices and architectures allows them to quickly discern valuable outputs and discard irrelevant or flawed suggestions (Stray et al., 2025). For experienced programmers, these models are tools to automate repetitive tasks such as writing tests, migrating or refactoring legacy code, freeing time for higher-level design and decision-making (Nettur et al., 2025; Tadi, 2025). This points to an emerging role for experienced developers as “orchestrators of intent”, where they set development goals and iteratively prompt the model to align with project needs (Tadi, 2025).

Across both groups, the development of software and its lifecycle is evolving. Developers now spend less time manually writing code and more time evaluating model outputs. This shift also reflects a change in the weighting of required skills. Rather than prioritizing deep expertise in a specific programming language, such as Python, there is growing value in developers who can effectively communicate with LLMs through prompt engineering. In some cases, a less experienced programmer with strong prompting skills may contribute more efficiently than a traditional developer unfamiliar with AI-assisted workflows (Stray et al., 2025; Nettur et al., 2025; Rasnayaka et al., 2024).

Risks, security and ethical implications

Despite the growing utility of LLMs in software development, their integration raises substantial security concerns. The foundational drawback is the frequent generation of insecure code, often containing vulnerabilities listed in the Common Weakness Enumeration (CWE) Top 25. For example, studies evaluating GitHub Copilot revealed that approximately 40 to 44% of its code suggestions contained security flaws, including hardcoded credentials, insufficient input validation, and poor error handling (Negri-Ribalta et al., 2024; Nettur et al., 2025). It has to be noted that these are not necessarily intricate errors, many are what developers consider “stupid bugs”: simple but critical one-line mistakes that can take longer to debug after being generated automatically (Jesse et al., 2023, as cited in Negri-Ribalta et al., 2024).

The significance of these issues varies by language. In C, over half of Copilot-generated programs included top-scoring vulnerabilities. Python, while generally safer, still saw over 38% of outputs containing major security issues (Negri-Ribalta et al., 2024). These figures suggest that LLMs, though capable, may still lack nuanced understanding of secure coding practices regardless of the language.

The source of this insecurity often lies in the data used for training this data, which mostly includes open-source repositories containing outdated or vulnerable code. This enables the unintended replication of bad practices or old faults (Negri-Ribalta et al., 2024; Nettur et al., 2025). In addition, LLMs are non-deterministic, meaning that identical prompts generally never produce the same output, complicating further the standardization of secure results. The quality of said prompts and surrounding context also heavily influence output quality, with ambiguous or unorganized prompts increasing the probability of insecure suggestions (Stray et al., 2025; Tabarsi et al., 2025).

Taken together, these risks underscore the importance of strong validation methods, secure-by-design development practices, and continued research into AI reliability. LLMs are a powerful tool in software development and have been improving at an exponential pace. As their capabilities grow, it is equally important to scale up efforts to address the associated security risks and adopt safeguards throughout the lifecycle of software development.

Human-AI collaboration and integration best practices

The usage of these assistants has introduced collaboration between human and AI in software development. Programming is evolving into a dynamic working relationship where developers declare some intent and LLMs serve as explorers or explainers (Tadi, 2025).

Developers now spend more time reviewing AI-generated suggestions, particularly during early project phases where LLMs assist with boilerplate code, syntax help, and simple algorithms (Rasnayaka et al., 2024; Stray et al., 2025). However, as mentioned earlier, for complex or domain-specific problems, human expertise is irreplaceable (Stray et al., 2025).

Strong prompt engineering is fundamental for efficient integration of this tools. Well-crafted, clear prompts improve code quality and relevance (Tadi, 2025; Stray et al., 2025). Developers often use iterative, dialogue-based prompting to better align outputs with their intent and uncover hidden requirements (Tadi, 2025).

Integrated Development Environments (IDEs), like VS Code and Cursor, are adapting to support this shift. Future IDEs may include features like prompt scaffolding, context visualization, and feedback loops to enhance collaboration. Tools that show how reliable the code is or where it came from can help developers check and trust what the AI suggests (Tadi, 2025; Nettur et al., 2025).

Despite AI’s benefits, rigorous human oversight is essential. Developers must critically evaluate LLM-generated code through reviews and testing to avoid adopting insecure or suboptimal solutions (Negri-Ribalta et al., 2024; Nettur et al., 2025; Tabarsi et al., 2025).

Conclusions

LLMs have rapidly become influential tools in software development, offering significant benefits in productivity, learning support, and automation. However, they also introduce serious concerns around code quality, security, and developer dependency. Their impact varies by experience level, with junior and senior developers using these tools in different ways. As LLMs continue to evolve, effective collaboration between human developers and AI will require strong prompt engineering, secure coding practices, and critical oversight. By understanding both the strengths and limitations of these models, developers can make better use of them in safe and efficient ways.

References

Chatterjee, S., Liu, C.L., Rowland, G. and Hogarth, T. (2024) The impact of AI tool on engineering at ANZ Bank: An empirical study on GitHub Copilot within corporate environment. arXiv preprint arXiv:2402.05636.

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H.P., Kaplan, J. et al. (2021) Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Available at: https://doi.org/10.48550/arXiv.2107.03374.

Negri-Ribalta, C., Geraud-Stewart, R., Sergeeva, A. and Lenzini, G. (2024) ‘A systematic literature review on the impact of AI models on the security of code generation’, Frontiers in Big Data, 7, p.1386720. Available at: https://doi.org/10.3389/fdata.2024.1386720.

Nettur, S.B., Karpurapu, S., Nettur, U., Gajja, L.S., Myneni, S. and Dusi, A. (2025) The role of GitHub Copilot on software development: A perspective on productivity, security, best practices and future directions. arXiv preprint arXiv:2502.13199.

Noy, S. and Zhang, W. (2023) ‘Experimental evidence on the productivity effects of generative artificial intelligence’, Science, 381(6608), pp.851–857. Available at: https://doi.org/10.1126/science.adh2586.

Rasnayaka, S., Wang, G., Shariffdeen, R. and Iyer, G.N. (2024) ‘An empirical study on usage and perceptions of LLMs in a software engineering project’, Proceedings of the 1st International Workshop on Large Language Models for Code. Lisbon, April 2024, pp.111–118.

Stray, V., Moe, N.B., Ganeshan, N. and Kobbenes, S. (2025) ‘Generative AI and developer workflows: How GitHub Copilot and ChatGPT influence solo and pair programming’, Proceedings of the 58th Hawaii International Conference on System Sciences.

Tabarsi, B., Reichert, H., Limke, A., Kuttal, S. and Barnes, T. (2025) LLMs’ reshaping of people, processes, products, and society in software development: A comprehensive exploration with early adopters. arXiv preprint arXiv:2503.05012.

Tadi, S.R.C.C.T. (2025) ‘Developer and LLM pair programming: An empirical study of role dynamics and prompt-based collaboration’, International Journal of Advanced Research in Science, Communication and Technology, 5(3), p.436. Available at: https://doi.org/10.48175/IJARSCT-26358.

Tang, N., Chen, M., Ning, Z., Bansal, A., Huang, Y., McMillan, C. and Li, T. (2023) ‘An empirical study of developer behaviors for validating and repairing AI-generated code’, 13th Workshop on the Intersection of HCI and PL. Boston, January 2023.

Treude, C. and Gerosa, M.A. (2025) How developers interact with AI: A taxonomy of human-AI collaboration in software engineering. arXiv preprint arXiv:2501.08774.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N. et al. (2017) ‘Attention is all you need’, Advances in Neural Information Processing Systems, 30, pp.5998–6008.

Yetistiren, B., Özsoy, I., Ayerdem, M. and Tüzün, E. (2023) Evaluating the code quality of AI-assisted code generation tools: An empirical study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv preprint arXiv:2304.10778.