🌑

Jian Fang

Specification- and Proof-Driven Development with LLM

Although program verification methods can provide logical correctness guarantees for software, they have long been regarded as a high-cost development approach. The main reasons are as follows:

  • High learning cost. Program verification requires developers to possess knowledge of program verification. But for a developer who has been working in the field for many years, asking them to learn mathematical logic that seems completely unrelated to their current work, or a new verification language (such as Rocq), is nothing short of an unreasonable demand.
  • High maintenance cost. Real-world program development scenarios constantly face all kinds of requirement changes. For the development processes of most companies, a requirement change means changing the original implementation code. After the new requirement is implemented, it is then handed over to testing for correctness verification. But if a project is proof-driven, changing the original code means that all previous proofs have to be redone, which is an enormous engineering overhead.
  • The amount of code written for verification far exceeds the amount of code for functional implementation. Referring to Sel4, the amount of code used for proof is more than ten times that of the functional implementation code. Paying dozens of times the cost for the sake of security is, in most scenarios, low in “cost-effectiveness”.

Although program verification methods face various difficulties in practical adoption, since the beginning of this year, with the release of Claude Opus 4.6, as well as the repeated functional iterations of LLM agents such as Claude Code and Cursor, these difficulties in adoption may now have a possibility of being solved. I experimented with the effectiveness of Cursor + Opus 4.6/Opus 4.7/GPT-5.5 in Rocq proofs, and the results were all quite stunning. In the past, such a proof would have required a person with Rocq experience to work for several days to complete. But now, with the help of LLMs, it only takes a few hours. From this point of view, it seems that the two problems of high maintenance cost and proof workload in program verification can be reasonably well solved.

As for the learning cost, this is a point that is difficult to assess. Current LLMs have indeed lowered the threshold for programming. With the help of LLMs, even a person who has never learned programming can obtain an executable program that completes the task. But as functionality iterates, task complexity rises, and there are more requirements for security/runtime efficiency, the project still requires developers with relevant professional knowledge to develop and maintain it. The larger and more complex the project, the more it requires experienced professionals to manage and maintain it. But it is undeniable that LLMs have indeed improved development efficiency and lowered the threshold for developers’ understanding of knowledge. As for program verification, because LLMs can solve part of the difficult problems in understanding program logic, developers do not need to become experts in the field of formal verification; knowing some basic logical foundations and a formal verification language is enough to get started writing verified code.

Based on this, we attempted to verify the possibility of using LLMs for Specification- and Proof-Driven Development with LLM (SPDDwL). The idea is that, given natural language requirements as input, the LLM writes all parts of the project that do not involve side effects in Rocq, then directly proves the logical properties that this part of the code needs to satisfy in Rocq, and finally exports all the proven code into the programming language implementation needed by the project, such as C++. The detailed process can be found in our paper: https://arxiv.org/pdf/2605.26017 . Through preliminary experiments, within 30 minutes, we automatically generated about 1700 lines of Rocq functional implementation and property proofs, while exporting 2k+ lines of C++ code. The exported code passed the sample tests as well as 12 hours of fuzz testing.

The current results offer us a possibility for program development: by writing in interactive theorem proving languages such as Rocq, and then exporting to the programming language the project actually uses, the development work can be completed. This specification- and proof-driven development process greatly guarantees the security and reliability of the code. With the help of LLMs, it also greatly reduces the cost of writing code and proofs. At the same time, this process brings another advantage: although LLMs have hallucinations and may generate code that looks correct but is actually wrong. Even if an LLM generates an incorrect Rocq implementation, the incorrect implementation cannot pass the logical proof during the proving stage, so the hallucination will not propagate into the exported code, thereby avoiding the model’s hallucinations.

From the current results, program verification methods now have a way that could potentially be used by more people and could be put into practice in projects. At present, this work is still in its infancy. Many problems remain to be solved, such as how to combine real requirements with formalization for the various requirements of real-world scenarios, and how to better combine LLMs with interactive theorem provers such as Rocq.

— May 29, 2026