Final week, I bought an e mail from Anthropic asserting that Claude 3.5 Sonnet was out there. In line with the AI firm, “Claude 3.5 Sonnet raises the business bar for intelligence, outperforming competitor fashions and Claude 3 Opus on a variety of evaluations.”
The corporate added: “Claude 3.5 Sonnet is right for complicated duties like code era.” I made a decision to see if that was true.
I am going to topic the brand new Claude 3.5 Sonnet mannequin to my customary set of coding exams — exams I’ve run in opposition to a variety of AIs with a variety of outcomes. Wish to comply with together with your personal exams? Level your browser to How I take a look at an AI chatbot’s coding capacity – and you’ll too, which accommodates all the usual exams I apply, explanations of how they work, and what to search for within the outcomes.
OK, let’s dig into the outcomes of every take a look at and see how they evaluate to earlier exams utilizing Microsoft Copilot, Meta AI, Meta Code Llama, Google Gemini Superior, and ChatGPT.
1. Writing a WordPress plugin
At first, this appeared to have a lot promise. Let’s begin with the person interface Claude 3.5 Sonnet created primarily based on my take a look at immediate.
That is the primary time an AI has determined to place the 2 knowledge fields side-by-side. The structure is clear and appears nice.
Claude additionally determined to do one thing else I’ve by no means seen an AI do. This plugin might be created utilizing simply PHP code, which is the code working on the again finish of a WordPress server.
However some AI implementations additionally have added JavaScript code (which runs within the browser to manage dynamic person interface options) and CSS code (which controls how the browser shows info).
In a PHP setting, for those who want PHP, JavaScript, and CSS, you possibly can both embody the CSS and JavaScript proper within the PHP code (that is a function of PHP), or you possibly can put the code in three separate recordsdata — one for PHP, one for JavaScript, and one for CSS.
Normally, when an AI needs to make use of all three languages, it exhibits what must be lower and pasted into the PHP file, then one other block to be lower and pasted right into a JavaScript file, after which a 3rd block to be lower and pasted right into a CSS file.
However Claude simply offered one PHP file after which, when it ran, auto-generated the JavaScript and CSS recordsdata into the plugin’s dwelling listing. That is each pretty spectacular and considerably wrong-headed. It is cool that it tried to make the plugin creation course of simpler, however whether or not or not a plugin can write to its personal folder relies on the settings of the OS configuration — and there is a very excessive likelihood it might fail.
I allowed it in my testing setting, however I might by no means permit a plugin to rewrite its personal code in a manufacturing setting. That is a really severe safety flaw.
Regardless of the pretty inventive nature of Claude’s code era answer, the underside line is that the plugin failed. Urgent the Randomize button does completely nothing. That is unhappy as a result of, as I mentioned, it had a lot promise.
Listed below are the combination outcomes of this and former exams:
- Claude 3.5 Sonnet: Interface: good, performance: fail
- ChatGPT GPT-4o: Interface: good, performance: good
- Microsoft Copilot: Interface: satisfactory, performance: fail
- Meta AI: Interface: satisfactory, performance: fail
- Meta Code Llama: Full failure
- Google Gemini Superior: Interface: good, performance: fail
- ChatGPT 4: Interface: good, performance: good
- ChatGPT 3.5: Interface: good, performance: good
2. Rewriting a string operate
This take a look at is designed to judge how the AI does rewriting code to work extra appropriately for the given want; on this case — {dollars} and cents conversions.
The Claude 3.5 Sonnet revision correctly eliminated main zeros, ensuring that entries like “000123” are handled as “123”. It correctly permits integers and decimals with as much as two decimal locations (which is the important thing repair the immediate requested for). It prevents unfavorable values. And it is sensible sufficient to return “0” for any bizarre or surprising enter, which prevents the code from abnormally ending in an error.
One failure is that it will not permit decimal values alone to be entered. So if the person entered 50 cents as “.50” as a substitute of “0.50”, it might fail the entry. Based mostly on how the unique textual content description for the take a look at is written, it ought to have allowed this enter kind.
Though many of the revised code labored, I’ve to rely this as a fail as a result of if the code have been pasted right into a manufacturing challenge, customers wouldn’t be capable of enter inputs that contained solely values for cents.
Listed below are the combination outcomes of this and former exams:
- Claude 3.5 Sonnet: Failed
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Succeeded
- Google Gemini Superior: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
3. Discovering an annoying bug
The large problem of this take a look at is that the AI is tasked with discovering a bug that is not apparent and — to unravel appropriately — requires platform information of the WordPress platform. It is also a bug I didn’t instantly see by myself and, initially, requested ChatGPT to unravel (which it did).
Claude not solely bought this proper — catching the subtlety of the error and correcting it — but it surely was additionally the primary AI since I revealed the total set of exams on-line to catch the truth that the publishing course of launched an error into the pattern question (which I subsequently mounted and republished).
Listed below are the combination outcomes of this and former exams:
- Claude 3.5 Sonnet: Succeeded
- ChatGPT GPT-4o: Succeeded
- Microsoft Copilot: Failed. Spectacularly. Enthusiastically. Emojically.
- Meta AI: Succeeded
- Meta Code Llama: Failed
- Google Gemini Superior: Failed
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Succeeded
To date, we’re at two out of three fails. Let’s transfer on to our final take a look at.
4. Writing a script
This take a look at is designed to see how far the AI’s programming information goes into specialised programming instruments. Whereas AppleScript is pretty widespread for scripting on Macs, Keyboard Maestro is a industrial utility offered by a lone programmer in Australia. I discover it indispensable, but it surely’s simply one in every of many such apps on the Mac.
Nonetheless, when testing in ChatGPT, ChatGPT knew learn how to “communicate” Keyboard Maestro in addition to AppleScript, which exhibits how broad its programming language information is.
Sadly, Claude doesn’t have that information. It did write an AppleScript that tried to talk to Chrome (that is a part of the take a look at parameter) but it surely ignored the important Keyboard Maestro element.
Worse, it generated code in AppleScript that might generate a runtime error. In an try to ignore case for the match within the take a look at, Claude generated the road:
if theTab's title accommodates enter ignoring case then
That is just about a double error as a result of the “accommodates” assertion is case insensitive and the phrase “ignoring case” doesn’t belong the place it was positioned. It brought about the script to error out with an “Ignoring cannot go after this” syntax error message.
Listed below are the combination outcomes of this and former exams:
- Claude 3.5 Sonnet: Failed
- ChatGPT GPT-4o: Succeeded however with reservations
- Microsoft Copilot: Failed
- Meta AI: Failed
- Meta Code Llama: Failed
- Google Gemini Superior: Succeeded
- ChatGPT 4: Succeeded
- ChatGPT 3.5: Failed
General outcomes
Listed below are the general outcomes of the 5 exams:
I used to be considerably bummed about Claude 3.5 Sonnet. The corporate particularly promised that this model was suited to programming. However as you possibly can see, not a lot. It is not that it may possibly’t program. It simply cannot program appropriately.
I preserve searching for an AI that may greatest the ChatGPT options, particularly as platform and programming setting distributors begin to combine these different fashions straight into the programming course of. However, for now, I am going again to ChatGPT after I want programming assist, and that is my recommendation to you as properly.
Have you ever used an AI that will help you program? Which one? How did it go? Tell us within the feedback beneath.
You’ll be able to comply with my day-to-day challenge updates on social media. You should definitely subscribe to my weekly replace publication, and comply with me on Twitter/X at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.