Ok I've been away for a while, but I'm still alive hehe.
I think that blanket statement may be false. I'm asking you to consider the possability that the SYSCON can know under some circumstances, and that it might actually know what it's talking about when it tells you general area where the problem is coming from.
Imagine this scenario:
SYSCON: "Hey CPU! Ya dead?"
CPU: "Nope, I'm good here."
SYSCON: "Hey SB! Ya dead?"
SB: "Nope, I'm good here."
SYSCON: "Hey RSX! Ya dead?"
RSX: "Nope, I'm good here."
SYSCON: "Hey CPU! Can you hear RSX okay?"
CPU: "Hold on! Hey RSX! Ya dead?"
RSX: "Nope, I can hear you loud and clear."
CPU: "Okay, I'm back. Yup, RSX checked in loud and clear."
SYSCON: "Okay you guys clear to start the bootloader. GG!"
The SYSCON must have comm line with each chip individually and they have lines with each other. By cross referencing who can't communicate with who, the SYSCON is able to narrow the fault down to a general area. I'll wager that most of the time a BGA defect is not going to land in a spot where it can't narrow it down to either the RSX or CPU. That's the kind of information SONY's repair techs want to know. SO I'm sure they enginerred a solution to. Probably a super awesom jig that runs spring pins that run dignostics with a neat GUI that spits out, "replace CPU" or "Replace RSX" or "replace IC6102."
That's not really the point though. The point I was making is that the SYSCON is trying to tell us where the problem is and we are ignoring it and applying a blanket fix, "Reball the RSX anyway." Just like false positive occur from hair driers due to thermomechanical reconnection of BGA defects, I'm sure a reflow on the RSX could look like a success when the CPU is where the BGA defect was and still is! Perhaps this is why some of these reflows/reballs fail soon after! They didn't also reflow the CPU. We're ignoring the bittraing error that's trying to tells us where the problem is. Of course, in the case of RSX errors, that's harder since we don't have the same documentation we have for the Cell.
May or may not. In any case be careful. The CPU is the brain. Telling people to reflow CPUs is a good way to have boards destroyed beyond repair. Sometimes for nothing.
If you notice, I was actually proposing a new kind of test to probe further in these scenarios. Not really a blanket, but I still stand by what I said.
I don't think the "SYSCON" alone can pinpoint the root of the problem in these cases. Because 3034 bittraining is the stage already past the individual checks. It happens between the CPU and RSX (and SB). That's why we are here trying to figure out what to do. Our version of your hypothetical "jig" that probes lots of points of the board. (But notice that this is already going beyond the capabilities of the SYSCON)
Let's take your scenario because it's totally relevant. CPU<~~~>RSX
CPU: Hey RSX, wake up and say hello!
RSX: ....
CPU: I can't hear anyhing...
SYSCON:Ok boys that's enough, power off. (CPU ERR 3034/44xx) pipipi
Now consider this scenario:
CPU: Hey RSX, wake up and say hello!
RSX: Hello!
CPU: I can't hear anyhing...
SYSCON:Ok boys that's enough, power off. (CPU ERR 3034/44xx) pipipi
To the syscon, these 2 scenarios look exactly the same. CPU did not hear valid response from RSX. 3034.
In one case there was nothing to hear, but in the other case the "hello" wasn't heard by the CPU, even though the RSX did its job.
Let's say it was a flexIO ball under the CPU that's not making proper contact. So the "hello" signal was stuck between the board and the CPU. In theory it's possible.
But it's also possible for example that it was a flexIO ball under the RSX. In the same line.
This different scenario would look exactly the same because these 2 balls were supposed to be connected directly. With nothing else in between. So there would be no possible way to know. Nothing to probe.
Just guessing that it's probably the RSX based on statistics. So probably wise to look at it first, also because keeping the CPU alive is #1 priority.
In this hypothetical case, both chips were good inside. Both doing their job. Just a connection problem.
But we know that reality isn't always like this.
It can be that this same ball/trace/pad is actually shorted to GND... Or to a neighboring data line. In this scenario again the CPU cannot hear valid response from the RSX. But the balls are all fine. The data lines are shorted inside the RSX. Even if the core is fine and reads a "healthy" 3 ohms as you say...
Now the funniest thing is that this kind of problem also reacts to even a little heat. And very predictably. Probably more so than a BGA reconnection due to thermal warping. So this is still a never ending story.
Just ask
@botakompong .
The difference is, that maybe we could test to find this... Without heat. Before reballing.
Beyond what the syscon can see.
Take a multimeter and a knife, and scratch and probe those famous data lines for shorts to GND or to each other. There should be no shorts.
And the hypothesis is, that maybe as you say the Syscon can in fact help us narrow down the points with the BitTraining status, and the additional 44xx RSX error... But only with the vertical axis. Not horizontally. Maybe we could map 4421 for example to a specific point or line. Instead of the whole vertical.