Wednesday, October 29, 2025

Co-Pilot is my... Co-Pilot

Sorry, I just couldn't miss the opportunity to mis-appropriate the title of the book...

For the past few months, I've been playing with 3 AI tools: Co-Pilot, Grok, and Claude. My reactions are mixed. I've used Co-Pilot the most, then Grok, and just a little bit of Claude. I'm using the "free" versions of all three. I've used them in my web browser, Visual Studio, and VS Code. 

I find Co-Pilot the most "interactive" in the sense that it seems like a real conversation. Grok is a bit more to the point. And Claude has been too stingy w/ it's "free" tokens. I decided not to bother with it any longer. 

I've used Co-Pilot for more than just coding. I've used it for more "general knowledge" purposes, like show me a graph/table of <put your statistics request here>. Grok & Claude have been strictly for coding questions.

Co-Pilot does very well with the "general knowledge" questions. All three do very well with very limited coding questions, such as "show me a method that does <your method requirements here>. But they all fail in a major fashion when you ask them to create a fully functioning app. 

For instance, I wanted to create a sample Blazor app. I wanted it to follow a "clean architecture" approach, using CQRS, and using LiteBus as the mediator, as MediatR is no longer free for the newest version. So Co-Pilot tried to execute my instructions. The application looked fine, but it simply wouldn't build. Every time I'd submit an error to Co-Pilot, it would come up w/ reasonable answers why it was occurring and suggest a fix. I'd apply the fix. More errors. More and more rounds of "here's the error" and "here's the fix". Still no joy. After about a day of struggling w/ this, the real reason for the failures became clear. Co-Pilot had used the latest versions of packages to install, but was using documentation that was from earlier versions. Some of the packages had breaking changes. When I pointed this out to Co-Pilot, it acknowledged the issue, tried a fix for the "current" documentation, but things just didn't get any better. In fact, after a while, it simply fell back to using method calls from the previous versions. In addition, even though it added the packages I asked for, the generated code using methods from the packages didn't have the requisite "using" statements.

I found similar issues w/ Grok on a different app I wanted to generate. Documentation from non-current package versions, missing "using" statements, etc. Other than those similar issues, it seemed that Grok produced better quality code, than Co-Pilot.

So why so many "mismatch" problems w/ large requests? I'm not an expert, but I think it has to do how all these engines are trained. I'll use an example from literature. If I wanted my AI to know about English Literature, I'd have it ingest and digest everything it can possibly find. That's a LOT of literature. I can then make it capable of creating more literature using everything it's learned. The problem is a lot of what has been ingested is mediocre or just plain bunk. There just aren't a lot of books written by people on the level of Shakespeare, Dickens, Hardy, or Austen. So it seems unlikely that AI is going to produce anything of that level, because it's been trained on a large universe of stuff, only a small percentage of which is actually outstanding.

I think there's a similar issue w/ coding. Grok, Co-Pilot, etc, can read all the code it wants in GitHub, BitBucket, etc. But how much of it is really good? 

How much do these tools intake of errors that are pointed out to them to self-correct contemporaneously to re-train themselves? Another instance I encounteedr: I asked a question of Co-Pilot today, and it generated some markdown that included some code-snippets. But the code-snippet section, which started with the correct "```" fence, didn't have a terminating "```" fence. When I pointed that out, Co-Pilot acknowledged the error and produced a corrected markdown. So then I asked "why don't you always produce the terminating sequence for all code-snippets in markdown". It's response was "You have to have 'memory' turned on for me to do that all the time for you". But shouldn't it be done all the time regardless of who is asking the question? Shouldn't the AI recognize that it made an error and correct it for everyone for every code-snippet it produces in the future for markdown? That just doesn't seem to be part of the paradigm.

Contemporaneous, persistent error correction needs to happen...