I've compiled all my recent tests with 11 different LLMs into one summary table. The results are based on 3 different Laravel projects that I published on my YouTube channel over last weeks.
Each prompt was launched 5 times on the same project, making it total of 15 points max.
For all evaluation tests passing, LLM got 1 point for the task. If at least one test failed, LLM got 0 points for that task.
So this is the summary table.

Premium members see full table with these LLMs tested, in alphabetical order:
Deepseek-V4-Pro / GLM-5.1 / Kimi K2.6 / MiMo 2.5 Pro / Minimax M2.7 / Qwen 3.6 Plus / Sonnet 4.6
I also tested Grok 4.3 but it performed so badly that I decided to NOT include it in this leaderboard.
I will continue testing models constantly - will come up with new tasks for evaluation, and will update when new LLMs are released.