PS3 Fault finding YLOD with the SYSCON - First steps and Error reporting

A Retrospective Analysis of SYSCON Errorlogs
(Reported here & on the NEC/TOKIN thread up to now)
Cause of YLOD.jpg

I decided to lead with the DATA you most wated to see. As of January 2022 114 users have reported over 250 Consoles worth of error codes to this thread and the NEC/TOKIN thread. I painstakenly collated all of the reported error codes and noted the consoles history/progression on the forum. What was wrong with it and how it was resolved, or if it wasn't made my best guess to diagnose it. A lot of the time the issue is unknown. So I created a category for that.

When I say "painstakenly," I mean I litterally burnt my eyes up on it. I have an optomitrist appointment moday to get them checked out. I suspect eye strain, perhaps an uncorrected vision problem. Whatever the case, it's a lot more unnerving than I imagined. Double vision will make you panic! I don't recommend it, take breaks!!! Why I did this to myself can only be described as a masochistic obsecession. Nay, an Ahab level vendetta against the YLOD and misinformation making things worse. I have a problem. I need to just let it go, but I don't want to.

In the meantime, please enjoy the fruits of my tribulation...
Untitled.jpg
Relative Abundance of Error Codes.jpg


The following SYSCON Error Code Matrix is the same DATA as above, but viewed in a way that makes seeing which errors group at earlier step numbers. This is important because of Power On Sequencing. Knowing what the console is doing at each step number (top row) allows you to deduce what the error code (leftmost column) means. The number highlighted in Red is the number of consoles reported to have exhibited that error (out of the 250 total consoles we have data for)...

Step No Vs Code.jpg

I will elaborate on the meaning of these errors and the step numbr at which the occur in "Power ON Topology Part 3" at a later date. But here is an example of how important the step number is in diagnosing.

Errors such as 1001 (CELL VRM Power Issue) and 1002 (RSX VRM Power Issue) can occur at many step numbers. If it occurs when the console is idle (Step #80), you know that nothing was wrong during Power On Sequence Testing (POST). For example, when NEC/TOKIN Proadlizers are begining to fail, ripple/noise under load triggers these errors (A0801001 / A0801002). As the NEC/TOKINs get worse they can interfere with the Power On Sequencing (POS). Important initialization steps and checks are performed during the POS. If excessive voltage ripple/noise causes an interruption during any one of these steps, you can get a 1001/1002 code with a step number earlier than 80.

You can better visualize this phenomenon using the second graph (Error Code Matrix). This matrix reveals other errors which exhibit the same behavior. 1004, 1200, 1301, 1802, 2024, 2030, 2031, 2033, 2101, 2120, 2124, and 2131 show the same ability to occur at multiple step numbers. This is significant and provides insight to what's causing the error. It's kind of a breakthrough for us!

For example, Step number A0 = Immediatly after SYSCON Reset. Errors seen occuring at that step number are 2030, 2031, 2033, 2124, & 2131. When the power Rocker is flipped on, IC6004 receive +5V_EVER directly from the PSU. It produces /SYSCON_RST automatically. An error immediately after SYSCON reset will beep the moment you flip that rocker or plug in the console!
  • IC6005/6 are powered by +5V_EVER and produce +3.3V_EVER and 1.8V_EVER respectively. These are the voltages that Power the SYSCON chip. If either of those IC's go bad, the SYSCON cannot do anything. You won't even get a standby LED. I'll look like the PSU is dead, when it isn't.
  • SYSCON Reset serves as enable for IC6009, which forms +3.3V_THERMAL for CPU, RSX, and SB Thermal Monitors. If they are bad at this step you get errors 2030, 2031, & 2033 errors respectively at step number A0.
  • If IC6009 is bad, it will probably cause 2030 because the CELL Thermal monitor is the first in line to be checked.
"Get to the point!"
If these errors occur later (step #80) you don't need to worry about any of this! When step numbers occur earlier than 80, you have to factor in what the console is doing at that step# to narrow down the possibilities.

The most poignant example is with 2120/3013 error combinations. The step number is everything. See the error combos section below.
I have a lot more, but I'll save it for "Power ON Topology Part 3."
Relative Abundance of SYSCON Error Codes.jpg

Common Error combinations:

3034/4xxx
3034/4401
8x Consoles total:
  • 2x consoles had the same BitTraining error (BE:RRAC:RX0:GLOBAL1:RX_STATUS). Others didn't post bringup.
  • 1x had 1001's leading up to this 3034/4401.
  • 1x had delid damage to CPU traces.
  • 1x was probed with an oscilloscope and confirmed to have bad CPU tokins, but the BGA defect was the immediate issue. He reballed the RSX, and it didn't fix the console. The data error changed to 4411 though.

3034/4402

11x Consoles total:
  • 2x RSX:RRAC:RX0:GLOBAL1:RX_STATUS
    • 1x Reflowed RSX and was last reported working.
  • 1x "RSX Nec/tokins are 3.0 Ohms, while CELL are 14.0 Ohms." That resistance on Cell is very high. Could be an open fault. Attempted a reflow that did not change the 40 3034 but the associated data error changed to 40 4401 with BitTraining BE:RRAC:RX0:GLOBAL1:RX_STATUS. It's unlikely he actually reflowed the solder.
  • 1x was reballed (both RSX & CELL) to GLOD. No errors. RSX was moved to known working board and did the same. Issues is suspected to be dead RSX VRAM, but that is not related to the original 3034/4402. CPU/GPU reball fixed it, the RSX was just dead.
  • 1x @vyktomvmpay25 Reballed both CPU/RSX to a GLOD. Concluded it was bad RSXRAM Replaced RSX to fix. Posted the resistance measurements which showed that VDDQ (VRAM) was okay. PLL_VDD should be reading in the Mega Ohms for a 40nm RSX (off the board). So that's probably bad. VDDC is a bit low. VDDR shouldn't be 640K, but that could be a typo. 0.640k would be about right.
    • RSX_PLL_VDD = 10.41kΩ
    • RSX_VDDC = 1.6 – 1.8Ω
    • FBVDDQ = 200Ω
    • VDDIO = 938kΩ
    • VDDR = 640kΩ
    • YC_RC_VDDA = 2,940kΩ
    • YC_RC_VDDIO = 9.28 kΩ
  • @vyktomvmpay25 Reballed both CPU/RSX to a GLOD. Again concluded was RSX VRAM, but this time didn't post resistance measuerments. Hes calling this situation a "Special GLOD."
  • @vyktomvmpay25 Reballed RSX and replaced tokins (it had 1001/1002 also). It became GLOD again. Didn't go further.
  • BitTraining RSX:RRAC:RX0:GLOBAL1:RX_STATUS
  • 1x Had melted plastic around vents due to hair drier trick (probably).

3034/4411

Not much known (Potentially Dead RSX)

3x Consoles total:
  • 1x GLOD Noted, no resolution.

3034/4412

9x Consoles total:
  • 1x RSX:RRAC:RX1:GLOBAL1:RX_STATUS
  • 1x RSX:RRAC:RX0:GLOBAL1:RX_STATUS
  • 1x Reflowed RSX, worked.
  • 1x Pressure test didn't work (inconclusive)
  • 1x had a long string of A0802203's followed by a long string of A0801802's leading up to the 3034/4412.
  • 1x had a string of A0203010 leading up to the 3034/4412.

3034/4421

9x Consoles total:
  • 1x Successfully reflowed/reballed
  • 1x showed A0801200s leading up to 801701 / 801601 & 8014FF, then 3034/4421 thereafter.
  • 1x RSX:RRAC:RX0:GLOBAL1:RX_STATUS. Reflowed, reported success.
  • 1x occurred 1yr after rsx delid (perhaps not gluing IHS back on led to BGA defect. Lasted 1 year?)

3034/4421

5x Consoles total:
  • 1x Sealed. Previous errors were A0801601, A0801701, A0801802, A0801004, A08014FF. Reballed to A0801701. Replaced CXM4024R no change. Reball CELL yeilded A0801001, A08014FF. Diagnosed RSX – bump / die failure.
  • 1x Heat test confirmed RSX BGA defect.

3034/4432

7x Consoles total:
  • 3x BitTraining RSX:RRAC:RX3:GLOBAL1:RX_STATUS
  • 1x Reflowed to a working state. Previously attempted replacing tokins, which didn't help. YLOD returned 2 months later, despite agressive fan curves, with 801701, 801601, 801802 because PWR was on at the time the BGA/Bump failed, then 403034/404402 thereafter.
  • 1x Pressure test didn't work (inconclusive). Reflow worked.

2120/3013
Here is a shortlist of the issues preceding A0202120/A0213013 errors
  • Failed Reflow
  • Replacing Tokins on a console previously experiencing A0403034
  • Reballing RSX (only)
  • Bad F6302 and C6320
  • Visible trace Damage on CPU caused A0213013 by itself
  • 0.0V at JL6354-JL6361 (DC/DC converters for AV backend)
NOTES:
  • On page 13 of the SYSCON thread @chiefhunnablunts accidentally overwrote the eeprom at address 3961 from FF to 00. It resulted in A0202120/A0213013 error. When he changed it back, the error disappeared. Why would changing the bit there cause errors related to YC_RC_VDDIO? If it's related to SYSCON/CELL communication over SPI, then that explains it. It's also possible there was a precarious BGA defect that responded to mounting pressure between tests, IDK.
  • On page 179 of the Tokin thread @TwelveAtNight fixed A0202120/A0203013 errors by replacing F6302 and C6320.
  • On page 25 of the SYSCON thread @ patricksouza472 fixed A0202120/A0213013 replacing F6302 and C6320. Note, this console produced 10x 2120 to every 1x 3013. @Aran3a noted the same thing on page 45, but we never thought to try these SMDs. @moptop219 on page 218 of the tokins thread noted the same. There's a good chance this is what's wrong with these consoles.
  • On page 25 of the SYSCON thread @db260179 fixed A0002120 by replacing TH2501.
  • On page 21 of the SYSCON thread @nyislander had a A0213013. "he checked voltages and found 0.0V at JL6354-JL6361 pts in the schematic and confirmed +3.3V_MISC & +5V_MISC are present. Lack of voltage on the DC/DC converters downstream of IC6301 suggests there could have been blown fuses (F6301/2).

Hypothesis

3013 = CPU side of YC_RC_VDDIO
Evidence (strong): Correlative with few confounding variables
  1. @Kleon1876 on page 36 of the SYSCON thread had A0213013 after CPU trace damage from a failed delid attempt.
  2. @poot36 on page 106 of the SYSCON thread had A0202120/A0213013 when his CPU interposer was cracked in half by a failed delid attempt.
  3. On page 21 of the SYSCON thread @nyislander had a A0213013. Confirmed lack of voltage on the DC/DC converters downstream of IC6301, which includes +1.2V_YC_RC_VDDIO.
2120 = RSX Side of YC_RC_VDDIO
Evidence (weak): Anecdotal and Associative, with many confounding variables.
  1. @Bbowes9 on page 82 of the SYSCON thread had an A0313032 caused by knocking R5167 during a failed delid attempt. This console was working before the delid attempt. R5167 is +1.2V_YC_RC_VDDIO refrence voltage for the CPU's Redwood FlexIO ADC differential reference clock pair (BE_RC_REFCLK_P). The voltage was not knocked out altogether, it was selectively knocked out on a specific reference clock after IC5004. He replaced the resistor and got A0402101 / A0403034 because RSX TX1 was shorted to ground by a nicked trace during the RSX delid (incredible luck). TX is the transmit line, so the CPU will note the error because it sees the short (BitTraining BE:RRAC:BX0:BX:FLEXIO_ID). He messed with the nick and the error changed to A0313031. This shows that issues with the BE side of the clock generators +1.2V_YC_RC_VDDIO reference voltage do not cause A0202120 and register in step number 31 (when the SYSCON check clocks). So step numbers 20/21 passes their checks (Initialize CPU, RSX, & AV backend). The voltage was good up to at least the DC/DC converters.
  2. @feng_ye on page 103 of the SYSCON thread had a GLOD A0802120 in which the HDMI transmitter was not being setup correctly. This was after probing an fixing numerous blown fuses. So we can be sure they were good at this point in the repair. He pressed on the corner of the RSX above VDDIO (He may have been referring to +1.5V_RSX_VDDIO instead of +1.2V_YC_RC_VDDIO. I'm still not sure how or if the two are related.) HDMI transmitter reset correctly, the 2120 disappeared, console booted. This confirmed a BGA defect affecting those balls can cause 2120 errors and a "Special" GLOD.
  3. @MicrowaveEgg on page 90 of the SYSCON posted an Errorlog showing a history of A0801001 leading up to a A0403034. Then it started giving A0231002/A0902120. GLOD after a BGA defect. Strange, but it's been known to happen, depending on how the ball reconnect thermomechanically from whatever repair attempt was made. He said, "PS3 was never opened." So there wasn't a reflow/reball. Later said he "recapped" and got a bunch of A0202120's, followed by one A0213013. And a bunch of A0231002's. My guess is there is both a BGA defect and potential fuse issue. The timeline of event is suspect too, IDK how trustworth his recollection of the errorlog and/or method of dumping it is. I marked this one in the BGA/Bumps category, but it's not as straightforward as other consoles.
  4. F6302 and downstream SMD's are involved in the formation of 1.2V_YC_RC_VDDIO. Also, BGA/Bump defects can affect both the GPU/CPU FlexIO pads for YC_RC_VDDIO. We see lots of 2120 error related to BGA/Bumps defects that are resolved by RSX reballs.
  5. @squeept posted at or around page 20 of the SYSCON thread about a console with A0A02031 A0202120/A0213013. It was not sealed. Flux everywhere, severe warp, likely heatgun. Shorts present near encoder chip and a bad choke. Did not attempt to fix.
Conclusion:
2120/3013 errors are possible in BGA/bump defects and fuse/SMDs. Distinguishing them from one another comes down to the step number at which they occur.
  • 80/90 = BGA/Bump defects and possibly voltage ripple/noise.
  • 20/21 = Fuses/SMDs (possibly a BGA, but less likly).
  • 00 = fuse.
Console needs troubleshooting. RSX and CPU Ohm tests and voltage test points need measured to build a picture of what's going on. Pay special attention to JL6354-JL6361, especially +1.2V_YC_RC_VDDIO. Check F6302 isn't blown. Check nearby caps for shorts.

3010

6x consoles
  • @Barteg, pg207, Tokin thread. A0203010 errors lead up to 3034/4412 (RSX RRAC BitTraining error = +1.2V_RSX_VDDR).
  • @marciolsf, pg1 of "fun with syscon" thread. A0203010 while live probing pins 19-21 of IC6103. That's the CPU Buck Controller Gate pins, which send PWM to coordinate 3 Buck converters (3-phase). Bridging these pins would cause the timing to fail and the Voltage feedback error compensation would freak out causing no PWR Good. Not sure if the SYSCON can tell if this was due to the Buck converters or not.
  • @hrist, pg77, Syscon thread. A0313031 & A0902120 after performing the "Eraser mod" (placing pressure on Underside hole of CPU to increase contact pressure with HS). Step# 31 = CPU initialization. Reflowed to errors A0202120/A0213010. Suspect balls didn't actually flow.
  • @Bosstom, Pg53, SYSCON thread. Had A0801002 errors fixed by replacing Side B tokins (both CPU/RSX). Console failed again 6 months later. He replaced Side A this time, which led to A0203010.

1802

This post is a particularly useful one to read.

7x consoles
  • @chiefhunnablunts on page 14 of SYSCON thread had A0201802 / A0A02031. There was an RSX thermal monitor error immediately after SYSCON Reset (Step# A0) followed by RSX Initialization error when the RSX is first initialized in POST. The RSX isn't responding at all...period...nada! And that thermal error is bad news too. He concluded a dead RSX moved on. I would liked to have seen him try replacing the thermal monitor and probe around the SYSCON reset circuit.
  • @squeept, pg23, Previously reworked. "EVERY chip heatgunned, warped. Current error was A0233020, "Previous errors were A0801301, A0902120. Absolutely covered in flux. Can't even see if there is any delamination from heatgun. No attempt made. Diagnosis = Heat gunned to death! Later he removed the RSX to see what would happen, "I could smell some magic smoke pretty much immediately, but I didn't see any fireworks. Then I still got a 2 second YLOD. I got errors A0A02031 and A0201802 at once. I ohm tested across the TOKIN after and got a dead short." @Byteman said, "Turning on PS3 without RSX will definitely blow your TOKINs. They will be shorted." I agree with this, it's similar to what happens when you don't use bridge wires - VRM tries to supply max voltage (3.3v) and it overloads the tokins 2.5v rating (which act as a fuse in case of an RSX short or open line scenario).
  • @moptop219, pg218 (tokin thread). Long string of A0802203's followed by a long string of A0801802's. The latest error was A0403034/A0404412. So a BGA/Bump defect is currently hiding another issue. Or the defect was affecting whichever pads cause 2203 and 1802 errors. Then it progressed to causing the traditional 3034/4xxx. IDK.
  • @squeept, pg 20. Sealed. Had A0403034 / A0404422, but previous errors in log showed A0801601, A0801701, A0801802, A0801004, A08014FF. Reballed to A0801701. Replaced CXM4024R no change. Reballed CELL yielded A0801001, A08014FF. Diagnosed RSX – bump / die failure.
  • @CodeKiller, pg2. Had A0403034 / A0404432. Reflowed to a working state. Previously attempted replacing tokins, which didn't help. YLOD returned 2 months later, despite aggressive fan curves, with 801701, 801601, 801802 because PWR was on at the time the BGA/Bump failed, then 403034/404402 thereafter[pg13].
  • @vyktormvmpay25, pg75. A0611802 "slim dropped on floor unit while it was working. rsx was missing pads. clean errors, same 1802 error only after adding new rsx." 1802, 14ff and 1701. He swapped the RSX to another motherboard for the customer. This one suggests that 1802 errors can be tied to the MB not just the RSX. But victor's wording/translation is a bit confusing. So maybe I'm misunderstanding him.
  • @Kibillcat, pg75. CELL BE die was chipped. GLOD, Be PLL unlock. "I would constantly receive A0611802 and A0801301 errors. Error 1301 would happen more often. But on one point I received HDMI error." He managed to record the bringup of this 1802 error...
    Code:
    bringup
    [SSM] state: 0000 -> 0101
    Bringup Mode #0 (0xFF)
    [SSM] ssmCb_OnStartingBePowOn() called.
    [SSM] Bringup mode : syspm_stat=00000000/00000000
    [POWSEQ] PowerSeq_Setup called.
    [SSM] state: 0101 -> 0201
    [POWSEQ] AV Backend Setup
    [SSM] state: 0201 -> 0102
    [SSM] state: 0102 -> 0202
    [SSM] state: 0202 -> 0103
    [SSM] state: 0103 -> 0203
    [SSM] ssmCb_BeforeBeOn() called.
    [SSM] state: 0203 -> 0104
    Psbd_SbTransMode_Half:0x20e2
    >$
    [SSM] state: 0104 -> 0204
    [SSM] state: 0204 -> 0105
    [SSM] nonfatalreq delayed.
    [SSM] state: 0105 -> 0400
    (PowerOn State)
    [SSM] RSX Interrupt : Detected !
    RSX SY_IES register (0x0008) = 0x4000000
    [SSM] state: 0400 -> 0700
    [POWSEQ] AV Backend Letup
    [SSM] ssmCb_AfterBeOn() called.
    [SSM] Shutdown mode : syspm_stat=00000000/00000000
    [ERROR]: 0xa0611802
    [POWSEQ] PowerSeq_Letup called.
    [SSM] state: 0700 -> 0600
    (PowerOff State) (Fatal)
Conclusion:
Step# 20 is when the RSX is first Initialized. So if it's not responding then it's borked! 1802 is the error the SYSCON will return when there is no RSX installed at all! @squeept showed that by testing a console without one.

However, it's not that simple. The specifics I don't understand. The DC/DC regulation that supplies power to the RSX are checked during step numbers 00-10. So there isn't anything wrong with them. Somewhere after the DC/DC converters and whatever voltage reference/signal the SYSCON monitors is where the 20 1802 can happen.

It's still unclear if the issue is on the RSX itself, or the motherboard. Various user have assumed the issue was a dead RSX, but @vyktormvmpay25 results suggests the MB may also be to blame. No one specifically said they repaired an 1802 by replacing the RSX.

What we need are Ohm tests of all the RSX voltages and the voltage reading upon bringup. This would narrow the fault down to the voltage line involved. Then that line needs to checked thoroughly for shorts, blown fuses, ripple voltage/noise. If all is good, the RSX should be replaced with a known good one (ohm tested all 7 voltages). And the old one should be cleaned and ohm tested to figure out which voltage was bad. That would confirm if this error is a dead RSX and which voltage causes it.

1701 / 1601 or 14FF
1701
ATTENTION is an active-high output flag sent by the CPU to the SYSCON. During initialization & configuration it is used to request an operation by the SYSCON. When ATTN goes High the syscon reads the SPI Status Register to determine the cause of the Attention signal. It remains high until software resets the condition that caused it.

Attention is used during Power On Reset (POR)...
  • To load CPU VID voltage from the VRM internal registers.
  • To Write configuration-ring data (Important CPU Config settings that should only be modified at boot, otherwise errors can occur).
  • To calibrate the FlexIO interface (BitTraining).
If Attention occurs during the Power ON State (Step# 80) it indicates an error condition. Basically, something is flagged by the Processor as abnormal. It's forced to attempt to resolve the problem before it can continue with whatever it was trying to do. When the CPU encounters an error condition during normal operation it sends the ATTENTION signal to the SYSCON. The SYSCON immediatly shuts off the console, then reads the SPI Status Register to determin the cause. Then it records the an error A0801701 in it's errorlog. Errors that can cause the Attention include
  • Unresolved Checkstop errors (14FF)
  • Livelock Detection (1601)
  • PLL Unlock Condition (1301)
  • BGA/Bump Defect that occurs while the Console was On (Step# 80). Subsequent attempts to power on the console would result in 3034/4xxx errors.
For example...
  • @poopskoop on pg27 of this thread had a console with a BGA/Bump defect (A0403034). He heatgunned it to a 3-5 second YLOD. It would error with the following log…
    Code:
    Attention BE : Detected !
    [SSM] BE Attention signal is detected !!
    [SSM] state: 0400 -> 0700
    [POWSEQ] AV Backend Letup
    [SSM] ssmCb_AfterBeOn() called.
    [SSM] Shutdown mode : syspm_stat=00000000/00000000
    [ERROR]: 0xa0801701.
  • Note, IBM's Hardware initialization guide says that "When the Cell BE processor drives the ATTENTION signal active, the reason for the attention is stored in the rd_spi_status register that is accessible from the SPI interface[...]If the rd_spi_status register reads as all zeros or all ones, then the data is not being correctly." I think that's what syspm_stat=00000000/00000000 is referring to.
  • He Posted pic of his RSX Ohm tests. VDDQ might have been a little low. I'm not sure if VRAM was on it's way out or not, or if that would cause this error. BAD RSX VRAM is known to cause a GLOD. And he did describe the console as a GLOD, even though he said it would shut off after 3-5 seconds. Regardless, that's long enough for the POR sequence to have completed. Attention should be driven low after BitTraining and stay there!
Configuration-Ring Load Check
The configuration ring is a series of bits that must be loaded into the CPU over SPI during the POR sequence. These bits configure the Cell BE processor before starting. They should only be set during POR. If they are changed after starting, it can result in faulty operation. The hardware must reset before the CPU's config-ring can be used again. After the POR the attention signal is driven low and is supposed to stay there! If there is a Checkstop error (14FF), Livelock Detection (1601), or PLL Unlock (1301) the CPU enters a fault condition and raises the Attention signal (1701). One common way this happens is when a solder connection breaks while the system is on.

The following is taken from IBM's Hardware Initialization Guide, CMOS SOI 65 nm Cell Broadband Engine.

The guide goes DEEP into the details of how the CELL BE Processore operates and is a wealth of knowlege, but it's quite advanced. If you can wrap your head around it though, it does offer a debugging guide in case of a 1701 error.
IBM said:
To summarize, check the following conditions:
  • That the rd_spi_status register matches the expected value.
  • That VDD and VCS are adjusted to the correct voltage indicated by the VID according to the Cell Broadband Engine Datasheet.
  • That the configuration ring is loaded with the partial good information from the rd_partial_good register.
  • That the following configuration-ring information is correct
    • The SPI address is correct.
    • The SPI simple write sequence is correct.
    • A '1' start bit prefixes the data.
    • The configuration data length matches the specified length for this Cell BE processor version

1601
  • CELL CPU is deadlocked and cannot proceed. Some "kind" of error occurred, preventing the process from completing. Basically this means the console froze and rebooted. It's the PS3 equivalent of the Blue Screen of Death (BSOD) you may already be familiar with. In the PS3 this is often preceded by graphical artifacting.
    O1mI87K.jpg

14FF
IBM said:
A checkstop occurs when - usually a processor but sometimes a cache, memory, or I/O bus controller - determines that something is in an "impossible" state. An error occurs that cannot be isolated to a particular bus transfer in progress, or a processor detects no progress being made...Checkstops are inherently hardware phenomena. They do not necessarily indicate a solid failure of a component, so diagnostics will rarely determine that a problem exists.
(Source)
It's similar to the LiveLock situation in 1601 errors but not the same. It's not suprising it can also occur at the time of a BGA defect.

We've seen this error a lot. The common story between them was the console was on at the time the YLOD occurred. All subsequent attempt to start the console resulted in a GLOD with subsequent 1601/1701 errors, or a YLOD within 2 seconds. SYSCON errors usually show one A0801601/A0801701 occurring at the same timestamp, followed thereafter by 3034/4xxx errors for all subsequent attempts to PWR it on. Or it'll GLOD and throw more 1701/1601's. I think this means there is a precarious BGA defect teetering on the edge of breaking and it'll soon switch to 3034/4xxx like the others. But that's a guess.

Complicating the issue is the fact that sometimes people will get a 1301, 1401, 14FF, or 1802 also. I'm not sure what to make of it. Perhaps it just has to do with where the BGA is failing and it's involving those sub systems briefly before it too fully breaks...IDK.

Here's a few examples:
  • @leral, pg79. Atempted a reflow, but it froze in XMB. Afterwards it either GLOD or YLOD with 1701/1601.
  • @Shawn Shakir, pg218 (Tokin thread). Oldest errors are 2x A0802022's and one A0801001. There was a 1701/1601 followed by a bunch of 3034's.
  • @squeept, pg 21. RSX reflowed previously. Beat up, filthy. Previous errors were A0801701, A0801601, A0801001. Noticed delamination on GPU when cleaning after reball. Not sure if present before due to mess. A0403034 remained after reball. Diagnosis = Heat gunned to death!
  • @squeept, pg23. Previously reworked. "Whole board heatgunned." Previous errors were A0801601, A0801701, A08014FF. "Discovered extensive delamination after ultrasonic bath cleaned the globs of flux off, stopped work." About the 14FF...
  • @andrewscott87, pg103. GLOD, video reset worked. Then it started artifacting and froze. Then back to GLOD. He can repeat that cycle. Sometimes there was a A0801001 in the log. Probably from fliping PWR off. 1701/14FF always come in a pair (occur at the same timestamp). He did not have any 1601 errors. Instead he has 14FF = CheckStop error.

I will edit this post to make improvements to the formatting and grammar. I will likely add new information about the error combos as I learn more. There's just so much to unpack for me to get it all in on the first go. So this is just the start.
 
Last edited:
A Retrospective Analysis of SYSCON Errorlogs
(Reported here & on the NEC/TOKIN thread up to now)
View attachment 35872

I decided to lead with the DATA you most wated to see. As of January 2022 114 users have reported over 250 Consoles worth of error codes to this thread and the NEC/TOKIN thread. I painstakenly collated all of the reported error codes and noted the consoles history/progression on the forum. What was wrong with it and how it was resolved, or if it wasn't made my best guess to diagnose it. A lot of the time the issue is unknown. So I created a category for that.

When I say "painstakenly," I mean I litterally burnt my eyes up on it. I have an optomitrist appointment moday to get them checked out. I suspect eye strain, perhaps an uncorrected vision problem. Whatever the case, it's a lot more unnerving than I imagined. Double vision will make you panic! I don't recommend it, take breaks!!! Why I did this to myself can only be described as a masochistic obsecession. Nay, an Ahab level vendetta against the YLOD and misinformation making things worse. I have a problem. I need to just let it go, but I don't want to.

In the meantime, please enjoy the fruits of my tribulation...
View attachment 35873

The following SYSCON Error Code Matrix is the same DATA as above, but viewed in a way that makes seeing which errors group at earlier step numbers. This is important because of Power On Sequencing. Knowing what the console is doing at each step number (top row) allows you to deduce what the error code (leftmost column) means. The number highlighted in Red is the number of consoles reported to have exhibited that error (out of the 250 total consoles we have data for)...

View attachment 35875
I will elaborate on the meaning of these errors and the step numbr at which the occur in "Power ON Topology Part 3" at a later date. But here is an example of how important the step number is in diagnosing.

Errors such as 1001 (CELL VRM Power Issue) and 1002 (RSX VRM Power Issue) can occur at many step numbers. If it occurs when the console is idle (Step #80), you know that nothing was wrong during Power On Sequence Testing (POST). For example, when NEC/TOKIN Proadlizers are begining to fail, ripple/noise under load triggers these errors (A0801001 / A0801002). As the NEC/TOKINs get worse they can interfere with the Power On Sequencing (POS). Important initialization steps and checks are performed during the POS. If excessive voltage ripple/noise causes an interruption during any one of these steps, you can get a 1001/1002 code with a step number earlier than 80.

You can better visualize this phenomenon using the second graph (Error Code Matrix). This matrix reveals other errors which exhibit the same behavior. 1004, 1200, 1301, 1802, 2024, 2030, 2031, 2033, 2101, 2120, 2124, and 2131 show the same ability to occur at multiple step numbers. This is significant and provides insight to what's causing the error. It's kind of a breakthrough for us!

For example, Step number A0 = Immediatly after SYSCON Reset. Errors seen occuring at that step number are 2030, 2031, 2033, 2124, & 2131. When the power Rocker is flipped on, IC6004 receive +5V_EVER directly from the PSU. It produces /SYSCON_RST automatically. An error immediately after SYSCON reset will beep the moment you flip that rocker or plug in the console!
  • IC6005/6 are powered by +5V_EVER and produce +3.3V_EVER and 1.8V_EVER respectively. These are the voltages that Power the SYSCON chip. If either of those IC's go bad, the SYSCON cannot do anything. You won't even get a standby LED. I'll look like the PSU is dead, when it isn't.
  • SYSCON Reset serves as enable for IC6009, which forms +3.3V_THERMAL for CPU, RSX, and SB Thermal Monitors. If they are bad at this step you get errors 2030, 2031, & 2033 errors respectivly at step number A0.
  • If IC6009 is bad, it will probably cause 2030 because the CELL Thermal monitor is the first in line to be checked.
I have a lot more, but I'll save it for "Power ON Topology Part 3."

Common Error combinations:

3034/4401
=
8x Consoles total:
· 2x consoles had the same BitTraining error (BE:RRAC:RX0:GLOBAL1:RX_STATUS). Others didn't post bringup.
· 1x had 1001's leading up to this 3034/4401.
· 1x had delid damage to CPU traces.
· 1x was probed with an oscilloscope and confirmed to have bad CPU tokins, but the BGA defect was the immediate issue. He reballed the RSX, and it didn't fix the console. The data error changed to 4411 though.

3034/4402

11x Consoles total:
· 2x RSX:RRAC:RX0:GLOBAL1:RX_STATUS
o 1x Reflowed RSX and was last reported working.
· 1x "RSX Nec/tokins are 3.0 Ohms, while CELL are 14.0 Ohms." That resistance on Cell is very high. Could be an open fault. Attempted a reflow that did not change the 40 3034 but the associated data error changed to 40 4401 with BitTraining BE:RRAC:RX0:GLOBAL1:RX_STATUS. It's unlikely he actually reflowed the solder.
· 1x was reballed (both RSX & CELL) to GLOD. No errors. RSX was moved to known working board and did the same. Issues is suspected to be dead RSX VRAM, but that is not related to the original 3034/4402. CPU/GPU reball fixed it, the RSX was just dead.
· 1x @vyktomvmpay25 Reballed both CPU/RSX to a GLOD. Concluded it was bad RSXRAM Replaced RSX to fix. Posted the resistance measurements which showed that VDDQ (VRAM) was okay. PLL_VDD should be reading in the Mega Ohms for a 40nm RSX (off the board). So that's probably bad. VDDC is a bit low. VDDR shouldn't be 640K, but that could be a typo. 0.640k would be about right.
o RSX_PLL_VDD = 10.41kΩ
o RSX_VDDC = 1.6 – 1.8Ω
o FBVDDQ = 200Ω
o VDDIO = 938kΩ
o VDDR = 640kΩ
o YC_RC_VDDA = 2,940kΩ
o YC_RC_VDDIO = 9.28 kΩ
· @vyktomvmpay25 Reballed both CPU/RSX to a GLOD. Again concluded was RSX VRAM, but this time didn't post resistance measuerments. Hes calling this situation a "Special GLOD."
· @vyktomvmpay25 Reballed RSX and replaced tokins (it had 1001/1002 also). It became GLOD again. Didn't go further.
· BitTraining RSX:RRAC:RX0:GLOBAL1:RX_STATUS
· 1x Had melted plastic around vents due to hair drier trick (probably).

3034/4411

Not much known (Potentially Dead RSX)
3x Consoles total:
· 1x GLOD Noted, no resolution.

3034/4412

9x Consoles total:
· 1x RSX:RRAC:RX1:GLOBAL1:RX_STATUS
· 1x RSX:RRAC:RX0:GLOBAL1:RX_STATUS
· 1x Reflowed RSX, worked.
· 1x Pressure test didn't work (inconclusive)
· 1x had a long string of A0802203's followed by a long string of A0801802's leading up to the 3034/4412.
· 1x had a string of A0203010 leading up to the 3034/4412.

3034/4421

9x Consoles total:
· 1x Successfully reflowed/reballed
· 1x showed A0801200s leading up to 801701 / 801601 & 8014FF, then 3034/4421 thereafter.
· 1x RSX:RRAC:RX0:GLOBAL1:RX_STATUS. Reflowed, reported success.
· 1x occurred 1yr after rsx delid (perhaps not gluing IHS back on led to BGA defect. Lasted 1 year?)

3034/4421

5x Consoles total:
· 1x Sealed. Previous errors were A0801601, A0801701, A0801802, A0801004, A08014FF. Reballed to A0801701. Replaced CXM4024R no change. Reball CELL yeilded A0801001, A08014FF. Diagnosed RSX – bump / die failure.
· 1x Heat test confirmed RSX BGA defect.

3034/4432

7x Consoles total:
· 3x BitTraining RSX:RRAC:RX3:GLOBAL1:RX_STATUS
· 1x Reflowed to a working state. Previously attempted replacing tokins, which didn't help. YLOD returned 2 months later, despite agressive fan curves, with 801701, 801601, 801802 because PWR was on at the time the BGA/Bump failed, then 403034/404402 thereafter.
· 1x Pressure test didn't work (inconclusive). Reflow worked.

2120/3013

Here is a shortlist of the issues preceding A0202120/A0213013 errors
· Failed Reflow
· Replacing Tokins on a console previously experiencing A0403034
· Reballing RSX (only)
· Bad F6302 and C6320
· Visible trace Damage on CPU caused A0213013 by itself
· 0.0V at JL6354-JL6361 (DC/DC converters for AV backend)
NOTES:
· On page 13 of the SYSCON thread @chiefhunnablunts accidentally overwrote the eeprom at address 3961 from FF to 00. It resulted in A0202120/A0213013 error. When he changed it back, the error disappeared. Why would changing the bit there cause errors related to YC_RC_VDDIO? If it's related to SYSCON/CELL communication over SPI, then that explains it. It's also possible there was a precarious BGA defect that responded to mounting pressure between tests, IDK.
· On page 179 of the Tokin thread @TwelveAtNight fixed A0202120/A0203013 errors by replacing F6302 and C6320.
· On page 25 of the SYSCON thread @ patricksouza472 fixed A0202120/A0213013 replacing F6302 and C6320. Note, this console produced 10x 2120 to every 1x 3013. @Aran3a noted the same thing on page 45, but we never thought to try these SMDs. @moptop219 on page 218 of the tokins thread noted the same. There's a good chance this is what's wrong with these consoles.
· On page 25 of the SYSCON thread @db260179 fixed A0002120 by replacing TH2501.
· On page 21 of the SYSCON thread @nyislander had a A0213013. "he checked voltages and found 0.0V at JL6354-JL6361 pts in the schematic and confirmed +3.3V_MISC & +5V_MISC are present. Lack of voltage on the DC/DC converters downstream of IC6301 suggests there could have been blown fuses (F6301/2).

Hypothesis

3013 = CPU side of YC_RC_VDDIO
Evidence (strong): Correlative with few confounding variables
1. @Kleon1876 on page 36 of the SYSCON thread had A0213013 after CPU trace damage from a failed delid attempt.
2. @poot36 on page 106 of the SYSCON thread had A0202120/A0213013 when his CPU interposer was cracked in half by a failed delid attempt.
3. On page 21 of the SYSCON thread @nyislander had a A0213013. Confirmed lack of voltage on the DC/DC converters downstream of IC6301, which includes +1.2V_YC_RC_VDDIO.
2120 = RSX Side of YC_RC_VDDIO
Evidence (weak): Anecdotal and Associative, with many confounding variables.
1. @Bbowes on page 82 of the SYSCON thread had an A0313032 caused by knocking R5167 during a failed delid attempt. This console was working before the delid attempt. R5167 is +1.2V_YC_RC_VDDIO refrence voltage for the CPU's Redwood FlexIO ADC differential reference clock pair (BE_RC_REFCLK_P). The voltage was not knocked out altogether, it was selectively knocked out on a specific reference clock after IC5004. He replaced the resistor and got A0402101 / A0403034 because RSX TX1 was shorted to ground by a nicked trace during the RSX delid (incredible luck). TX is the transmit line, so the CPU will note the error because it sees the short (BitTraining BE:RRAC:BX0:BX:FLEXIO_ID). He messed with the nick and the error changed to A0313031. This shows that issues with the BE side of the clock generators +1.2V_YC_RC_VDDIO reference voltage do not cause A0202120 and register in step number 31 (when the SYSCON check clocks). So step numbers 20/21 passes their checks (Initialize CPU, RSX, & AV backend). The voltage was good up to at least the DC/DC converters.
2. @feng_ye on page 103 of the SYSCON thread had a GLOD A0802120 in which the HDMI transmitter was not being setup correctly. This was after probing an fixing numerous blown fuses. So we can be sure they were good at this point in the repair. He pressed on the corner of the RSX above VDDIO (He may have been referring to +1.5V_RSX_VDDIO instead of +1.2V_YC_RC_VDDIO. I'm still not sure how or if the two are related.) HDMI transmitter reset correctly, the 2120 disappeared, console booted. This confirmed a BGA defect affecting those balls can cause 2120 errors and a "Special" GLOD.
3. @MicrowaveEgg on page 90 of the SYSCON posted an Errorlog showing a history of A0801001 leading up to a A0403034. Then it started giving A0231002/A0902120. GLOD after a BGA defect. Strange, but it's been known to happen, depending on how the ball reconnect thermomechanically from whatever repair attempt was made. He said, "PS3 was never opened." So there wasn't a reflow/reball. Later said he "recapped" and got a bunch of A0202120's, followed by one A0213013. And a bunch of A0231002's. My guess is there is both a BGA defect and potential fuse issue. The timeline of event is suspect too, IDK how trustworth his recollection of the errorlog and/or method of dumping it is. I marked this one in the BGA/Bumps category, but it's not as straightforward as other consoles.
4. F6302 and downstream SMD's are involved in the formation of 1.2V_YC_RC_VDDIO. Also, BGA/Bump defects can affect both the GPU/CPU FlexIO pads for YC_RC_VDDIO. We see lots of 2120 error related to BGA/Bumps defects that are resolved by RSX reballs.
5. @squeept posted at or around page 20 of the SYSCON thread about a console with A0A02031 A0202120/A0213013. It was not sealed. Flux everywhere, severe warp, likely heatgun. Shorts present near encoder chip and a bad choke. Did not attempt to fix.
Conclusion:
2120/3013 errors are possible in BGA/bump defects and fuse/SMDs. Distinguishing them from one another comes down to the step number at which they occur.
80/90 = BGA/Bump defects and possibly voltage ripple/noise.
20/21 = BGA/Bump defects or fuses/SMDs.
00 = fuse.
Console needs troubleshooting. RSX and CPU Ohm tests and voltage test points need measured to build a picture of what's going on. Pay special attention to JL6354-JL6361, especially +1.2V_YC_RC_VDDIO. Check F6302 isn't blown. Check nearby caps for shorts.

3010

6x consoles
· @Barteg, pg207, Tokin thread. A0203010 errors lead up to 3034/4412 (RSX RRAC BitTraining error = +1.2V_RSX_VDDR).
· @marciolsf, pg, Syscon thread. A0203010 while live probing pins 19-21 of IC6103. That the CPU Buck Controller's Gate pins, which send PWM to coordinate 3 Buck converters (3-phase). Bridging these pins would cause the timing to fail and the Voltage feedback error compensation would freak out causing no PWR Good. Not sure if the SYSCON can tell if this was due to the Buck converters or not.
· @hrist, pg77, Syscon thread. A0313031 & A0902120 after performing the "Eraser mod" (placing pressure on Underside hole of CPU to increase contact pressure with HS). Step# 31 = CPU initialization. Reflowed to errors A0202120/A0213010. Suspect balls didn't actually flow.
· @Bosstom, Pg53, SYSCON thread. Had A0801002 errors fixed by replacing Side B tokins (both CPU/RSX). Console failed again 6 months later. He replaced Side A this time, which led to A0203010.

I will edit this post to make improvments to the formatting and grammer. I will likly add new information about the error combos as I learn more. There's just so much to unpack for me to get it all in on the first go. So this is just the start.


Very thorough as always! In all your adventures, did you happen to collect bit training messages as well? If so, do you mind sharing them? Or are all of them part of this post already?

Also, while getting back into things, I went through your documents (as posted in your signature), but I did not see that one image with all the testing points you'd recommend. Do you still have that? I've been going through th threads but haven't found them.
 
Hi everyone, I got PS3 fat model ver COK-001 with light turning Green and Red. I got errlog from syscon. Errcode 0xa0092113 Clock Generator Error (IC5004), what does it mean? Does anyone know how to fix it? Thanks!
version
v1.1.3_k1
[mullion]$
>$ patchvereep
patchvereep
major:0x0001
minor:0x0001
patch:0x0003
revision:0x0003
[mullion]$
>$ patchcsum
patchcsum
r1 csum: [00030266] [018DB626] [90662679]
r2 csum: [000069C5] [0046B830] [5E535A06]
[mullion]$
>$ errlog
[SSM] state: 0600 -> 0000
[SSM] Error state is cleared.
(PowerOff State)
[SSM] state: 0000 -> 0101
Bringup Mode #0 (0xFF)
[SSM] ssmCb_OnStartingBePowOn() called.
[SSM] Bringup mode : syspm_stat=00000000/00000000
[POWSEQ] PowerSeq_Setup called.
[SSM] state: 0101 -> 0301
[SSM] PowSeq Fail : Detected !
[SSM] state: 0301 -> 0700
[POWSEQ] AV Backend Letup
[SSM] Shutdown mode : syspm_stat=00000000/00000000
[ERROR]: 0xa0092113
[POWSEQ] PowerSeq_Letup called.
[SSM] state: 0700 -> 0600
(PowerOff State) (Fatal)
errlog
ofst[ 16]:err_code:0xffffffff, clock:0xffffffff
ofst[ 20]:err_code:0xffffffff, clock:0xffffffff
ofst[ 24]:err_code:0xffffffff, clock:0xffffffff
ofst[ 28]:err_code:0xffffffff, clock:0xffffffff
ofst[ 32]:err_code:0xffffffff, clock:0xffffffff
ofst[ 36]:err_code:0xffffffff, clock:0xffffffff
ofst[ 40]:err_code:0xffffffff, clock:0xffffffff
ofst[ 44]:err_code:0xffffffff, clock:0xffffffff
ofst[ 48]:err_code:0xffffffff, clock:0xffffffff
ofst[ 52]:err_code:0xffffffff, clock:0xffffffff
ofst[ 56]:err_code:0xffffffff, clock:0xffffffff
ofst[ 60]:err_code:0xffffffff, clock:0xffffffff
ofst[ 64]:err_code:0xffffffff, clock:0xffffffff
ofst[ 68]:err_code:0xffffffff, clock:0xffffffff
ofst[ 72]:err_code:0xffffffff, clock:0xffffffff
ofst[ 76]:err_code:0xffffffff, clock:0xffffffff
ofst[ 80]:err_code:0xffffffff, clock:0xffffffff
ofst[ 84]:err_code:0xffffffff, clock:0xffffffff
ofst[ 88]:err_code:0xffffffff, clock:0xffffffff
ofst[ 92]:err_code:0xffffffff, clock:0xffffffff
ofst[ 96]:err_code:0xffffffff, clock:0xffffffff
ofst[100]:err_code:0xffffffff, clock:0xffffffff
ofst[104]:err_code:0xffffffff, clock:0xffffffff
ofst[108]:err_code:0xffffffff, clock:0xffffffff
ofst[112]:err_code:0xffffffff, clock:0xffffffff
ofst[116]:err_code:0xffffffff, clock:0xffffffff
ofst[120]:err_code:0xffffffff, clock:0xffffffff
ofst[124]:err_code:0xffffffff, clock:0xffffffff
ofst[ 0]:err_code:0xa0092113, clock:0xffffffff
ofst[ 4]:err_code:0xa0092113, clock:0xffffffff
ofst[ 8]:err_code:0xa0092113, clock:0xffffffff
ofst[ 12]:err_code:0xa0092113, clock:0xffffffff
[mullion]$
 
Very thorough as always! In all your adventures, did you happen to collect bit training messages as well? If so, do you mind sharing them? Or are all of them part of this post already?

Also, while getting back into things, I went through your documents (as posted in your signature), but I did not see that one image with all the testing points you'd recommend. Do you still have that? I've been going through th threads but haven't found them.
If you don't found inside my link nothing helpful please quote me for what exactly are you looking to test.
http://s.go.ro/ax49drsu
 
I soldered a refurbished RSX onto the board and it ended up with instant 3 beeps.

ERRLOG:

>$ errlog
00000000
# CODE CLOCK

New (refurbished) RSX soldered in:
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF

Original RSX before swap:
# A0801601 0B49D885
# A0801701 0B49D884
# A0801601 0B49D84C
# A0801701 0B49D84B
# A0801601 0B49D818
# A0801701 0B49D817


This is interesting, as the A0093004 is marked as Power Fail,
A0093004 = RSX_POW_FAIL poweroff state

This console already has 1 tokin replaced with 4 tantals.
Console starts, 3 beep and dies.

I can retry and use my last good RSX that I have in stock,
or... replace all tokins with tantals...

What do you guys think?
 
Very thorough as always! In all your adventures, did you happen to collect bit training messages as well? If so, do you mind sharing them? Or are all of them part of this post already?
Yes, but it's hard to say if there's correlation.
BitTraining ErrorError Combo
BE:RRAC:RX0:GLOBAL1:RX_STATUSA0403034
BE:RRAC:RX4:GLOBAL1:RX_STATUSA0403034
RSX:RRAC:BX0:BX:FLEXIO_IDA0403034
RSX:RRAC:BX0:BX:FLEXIO_IDA0403034 A0902120
RSX:RRAC:BX0:BX:FLEXIO_IDA0403034 A0902120
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034 A0902120
BE:RRAC:BX0:BX:FLEXIO_IDA0403034/A0402101
BE:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404401
BE:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404401
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404402
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404402
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404412
RSX:RRAC:RX1:GLOBAL1:RX_STATUSA0403034/A0404412
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404421
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432

One of the hypotheses I wanted to investigate was 'if certain DATA errors trigger the same BitTraining Error.' The answer is inconclusive.

For example, RSX:RRAC:RX0:GLOBAL1:RX_STATUS had 2 DATA errors associated with it (4421 & 4412). I linked to the posts to make this point. I posed my errorlog, @Lynx didn't. He just related the code. He could have transposed the 12 and written it as "21" accidentally, but I can't make that assumption. I have to record the data as it's written on the forum. It's the only other BitTraining error in the data set that matches mine. In a perfect world people would copy/paste their errorlog, removing an opportunity for typos to creep in, but this is the dirty nature forum data collection. Perhaps our code's and BitTraining errors do match, but unless he still has the logs and wants to post an update, we'll never know.
Also, while getting back into things, I went through your documents (as posted in your signature), but I did not see that one image with all the testing points you'd recommend. Do you still have that? I've been going through th threads but haven't found them.
Which one?

In "Power Control Topology Part 2" I posted the voltage tree and MB Test Points. I never posted it, but I made a voltage worksheet with all the jumper lead testpoints nested into their functional group. People may find it useful.
JumperIC/SMDVoltage OR Nominal Signal State (High/Low)Measurment & Notes

Power Suply & DC/DC Converters

JL9623PSUACDC Standby = High
PSUCN6004+12V_MAIN
JL9621CN6005+5V_EVER
JL2452IC2403+5V_ANA
JL9644IC6003 Vout 1+5V_MISC
JL6052IC6005+3.3V_EVER
JL6051IC6006+1.8V_EVER
JL9645IC6003 Vout 2+3.3V_MISC
JL9646Q6305+1.7V_MISC
JL6053IC6020+3.3V_MK_VDD
JL6054IC6013+2.5V_LREG_XCG_500_MEM

CPU (CELL Broadband Engine)
JL6063IC6012+1.2V_MC2_VDDIO
JL6103IC6103/BE_POW_FAIL = High
JL6104IC6103ENABLE = High
JL6109IC6104+1.0V_BE_VDDC (CS Input)
JL6110IC6104+1.0V_BE_VDDC (CS Output)
JL6111IC6105+1.0V_BE_VDDC (CS Input)
JL6112IC6105+1.0V_BE_VDDC (CS Output)
JL6113IC6106+1.0V_BE_VDDC (CS Input)
JL6114IC6106+1.0V_BE_VDDC (CS Output)
JL9653IC6103+1.0V_BE_VDDS1
JL9654IC6007+1.6V_BE_VDDA
JL9655IC6304+1.5V_YC_RC_VDDA

GPU (RSX - Reality Synthesizer)
JL6048IC6008+1.8V_RSX_PLL_VDD
JL6056IC6019+1.5_AVCG_VDDIO
JL6200IC6201ENABLE = High
JL6201IC6201/RSX_POW_FAIL = HIGH
JL6205IC6302+1.2V_RSX_VDDC (CS Input)
JL6206IC6302+1.2V_RSX_VDDC (CS Output)
JL6207IC6203+1.2V_RSX_VDDC (CS Input)
JL6208IC6203+1.2V_RSX_VDDC (CS Output)
JL9650IC6201+1.2V_RSX_VDDMONI
JL9656IC6017+1.5V_RSX_VDDIO
JL9657Q6304+1.8V_RSX_FBVDDQ

South Bridge & Peripherals
JL6058Q6004+3.3V_SB_VDDIO
JL6060IC6011+2.5V_SB_PLL_VDDC
JL6061IC6014+1.8V_SB_PERI
JL6064Q6010+12V_BD
JL9658Q6010+5V_BD
JL6067Q6007+5V_HDD
JL6057Q6008+5V_USB
JL3551Q3503+3.3V_ESW
JL3502IC3502+1.9V_ESW
JL9649Q3501+1.2V_ESW

Note:
VDDC CS input/output are for Current Sensing. There is a shunt resistor (0.001 Ohm) between the Inductor and the tokin array. There is one for each Buck regulator. On Side B of the motherboard there are 2 testpoint on the input/output side of that shunt resistor. By measuring the voltage drop across it (Delta V) you can calculate the Current being supplied by each buck converter using Ohms law.

If you hook up 4 oscilloscope probe to these 4 testpoints (2 phase RSX VRM) you can setup the scope to do the math and output the difference (Ch1-Ch2 = Delta V1, Ch3-Ch4 = Delta V2). If you then capture the waveform you can calculate the current being supplied. May be useful for diagnostics, so I added a a row for them, but it's not super relevant for our purposes. Unless you want a more accurate way to directly measure power consumption than a Kill-A-Watt. Or if you want to calculate it for the RSX/CPU independently.
 
I posted the voltage tree and MB Test Points.
That's what I was looking for! The motherboard with the annotated test points, thanks. But the spreadsheet is nice, it's more organized than the notebook I tend to use.

Yes, but it's hard to say if there's correlation.

Looking at the schematics and docs I read, FlexIO is both a protocol of sorts (implemented at the chip level), and traces that connect rsx straight to cell. All Flexio really seems to do is provide a way for rsx to have direct communication with cell and offload some processing to it. In that context, bit training makes total sense.

IIRC, all the 3034's I've seen so far are bit training errors, and all the bit training messages I've seen are on the Flexio "area" at the edge of package. If this trend is accurate, this might be an useful clue (even if reball is still the only fix for now)
 
That's what I was looking for! The motherboard with the annotated test points, thanks. But the spreadsheet is nice, it's more organized than the notebook I tend to use.
For me, having the IC listed right next to the voltage is the killer feature. It allows me to quickly find the relevant part of the service manual for troubleshooting - identifying SMD's to check for shorts.
Looking at the schematics and docs I read, FlexIO is both a protocol of sorts (implemented at the chip level), and traces that connect rsx straight to cell. All Flexio really seems to do is provide a way for rsx to have direct communication with cell and offload some processing to it. In that context, bit training makes total sense.

IIRC, all the 3034's I've seen so far are bit training errors, and all the bit training messages I've seen are on the Flexio "area" at the edge of package. If this trend is accurate, this might be an useful clue (even if reball is still the only fix for now)
I meant the same DATA error (eg, 4412) doesn't cause the same BitTraining error every time. BitTraining errors do seem to refer to the FlexIO a lot. Which doesn't surprise me, since those BGA pads are closest to the CPU. The combined heat from both the CPU/RSX travel through the motherboard centered around the FlexIO. So those balls are closest to the highest Delta T. Strain is greatest there.

As I previously mentioned, the RSX is lacking stiffening. And that would be a good place for a spacer to be glued to for reinforcement. It would make delidding impossible, but I think it's a valid approach to increasing the 90nm RSX's reliability. I'm thinking of designing a replacement IHS that contacts the surface there. It would need to be glued on as a complete replacement IHS. The glue is important to reduce package warping. That would work for the FlexIO, but the SMD's on the other 3 sides physically block this approach to stiffening the interposer above VDDIO and VDDC.

BTW, @squeept offers a 1 year warranty on his reballs and always glues the IHS back on! Ever wonder why he spends money on glue? Now you know!
 
Yes, but it's hard to say if there's correlation.
BitTraining ErrorError Combo
BE:RRAC:RX0:GLOBAL1:RX_STATUSA0403034
BE:RRAC:RX4:GLOBAL1:RX_STATUSA0403034
RSX:RRAC:BX0:BX:FLEXIO_IDA0403034
RSX:RRAC:BX0:BX:FLEXIO_IDA0403034 A0902120
RSX:RRAC:BX0:BX:FLEXIO_IDA0403034 A0902120
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034 A0902120
BE:RRAC:BX0:BX:FLEXIO_IDA0403034/A0402101
BE:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404401
BE:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404401
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404402
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404402
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404412
RSX:RRAC:RX1:GLOBAL1:RX_STATUSA0403034/A0404412
RSX:RRAC:RX0:GLOBAL1:RX_STATUSA0403034/A0404421
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432
RSX:RRAC:RX3:GLOBAL1:RX_STATUSA0403034/A0404432

One of the hypotheses I wanted to investigate was 'if certain DATA errors trigger the same BitTraining Error.' The answer is inconclusive.

For example, RSX:RRAC:RX0:GLOBAL1:RX_STATUS had 2 DATA errors associated with it (4421 & 4412). I linked to the posts to make this point. I posed my errorlog, @Lynx didn't. He just related the code. He could have transposed the 12 and written it as "21" accidentally, but I can't make that assumption. I have to record the data as it's written on the forum. It's the only other BitTraining error in the data set that matches mine. In a perfect world people would copy/paste their errorlog, removing an opportunity for typos to creep in, but this is the dirty nature forum data collection. Perhaps our code's and BitTraining errors do match, but unless he still has the logs and wants to post an update, we'll never know.

Which one?

In "Power Control Topology Part 2" I posted the voltage tree and MB Test Points. I never posted it, but I made a voltage worksheet with all the jumper lead testpoints nested into their functional group. People may find it useful.
JumperIC/SMDVoltage OR Nominal Signal State (High/Low)Measurment & Notes
Power Suply & DC/DC Converters
JL9623PSUACDC Standby = High
PSUCN6004+12V_MAIN
JL9621CN6005+5V_EVER
JL2452IC2403+5V_ANA
JL9644IC6003 Vout 1+5V_MISC
JL6052IC6005+3.3V_EVER
JL6051IC6006+1.8V_EVER
JL9645IC6003 Vout 2+3.3V_MISC
JL9646Q6305+1.7V_MISC
JL6053IC6020+3.3V_MK_VDD
JL6054IC6013+2.5V_LREG_XCG_500_MEM
CPU (CELL Broadband Engine)
JL6063IC6012+1.2V_MC2_VDDIO
JL6103IC6103/BE_POW_FAIL = High
JL6104IC6103ENABLE = High
JL6109IC6104+1.0V_BE_VDDC (CS Input)
JL6110IC6104+1.0V_BE_VDDC (CS Output)
JL6111IC6105+1.0V_BE_VDDC (CS Input)
JL6112IC6105+1.0V_BE_VDDC (CS Output)
JL6113IC6106+1.0V_BE_VDDC (CS Input)
JL6114IC6106+1.0V_BE_VDDC (CS Output)
JL9653IC6103+1.0V_BE_VDDS1
JL9654IC6007+1.6V_BE_VDDA
JL9655IC6304+1.5V_YC_RC_VDDA
GPU (RSX - Reality Synthesizer)
JL6048IC6008+1.8V_RSX_PLL_VDD
JL6056IC6019+1.5_AVCG_VDDIO
JL6200IC6201ENABLE = High
JL6201IC6201/RSX_POW_FAIL = HIGH
JL6205IC6302+1.2V_RSX_VDDC (CS Input)
JL6206IC6302+1.2V_RSX_VDDC (CS Output)
JL6207IC6203+1.2V_RSX_VDDC (CS Input)
JL6208IC6203+1.2V_RSX_VDDC (CS Output)
JL9650IC6201+1.2V_RSX_VDDMONI
JL9656IC6017+1.5V_RSX_VDDIO
JL9657Q6304+1.8V_RSX_FBVDDQ
South Bridge & Peripherals
JL6058Q6004+3.3V_SB_VDDIO
JL6060IC6011+2.5V_SB_PLL_VDDC
JL6061IC6014+1.8V_SB_PERI
JL6064Q6010+12V_BD
JL9658Q6010+5V_BD
JL6067Q6007+5V_HDD
JL6057Q6008+5V_USB
JL3551Q3503+3.3V_ESW
JL3502IC3502+1.9V_ESW
JL9649Q3501+1.2V_ESW
Note:
VDDC CS input/output are for Current Sensing. There is a shunt resistor (0.001 Ohm) between the Inductor and the tokin array. There is one for each Buck regulator. On Side B of the motherboard there are 2 testpoint on the input/output side of that shunt resistor. By measuring the voltage drop across it (Delta V) you can calculate the Current being supplied by each buck converter using Ohms law.

If you hook up 4 oscilloscope probe to these 4 testpoints (2 phase RSX VRM) you can setup the scope to do the math and output the difference (Ch1-Ch2 = Delta V1, Ch3-Ch4 = Delta V2). If you then capture the waveform you can calculate the current being supplied. May be useful for diagnostics, so I added a a row for them, but it's not super relevant for our purposes. Unless you want a more accurate way to directly measure power consumption than a Kill-A-Watt. Or if you want to calculate it for the RSX/CPU independently.

Not sure if this is relevant, but the 3034 combos with bittraining errors... I'd imagine the data errors are a bit random. You can also check the order in which bit training is happening here.

https://www.psdevwiki.com/ps3/Rambus_Registers#

So whichever step it is stuck on, then that contact is broken? You can probably trace the exact pad. For instance when I had issues with the boards flexing, the bga wasn't soldered right. Some contacts were lost. So I got errors like

ofst[ 88]:err_code:0xa0404411, clock:0xffffffff
ofst[ 92]:err_code:0xa0403034, clock:0xffffffff
[POWERSEQ] Error : BitTraining BE:RRAC:RX1:GLOBAL1:RX_STATUS

Then a second time I resoldered it I didn't save the precise data error, but it was a different contact broken this time. So I got stuck at : BE:RRAC:RX2:GLOBAL1:RX_STATUS. (So it went a bit further but still no dice lol). I eventually got it to work.. So I know for sure it wasn't a dead rsx, but a true bga problem. Now I realize I should've saved the results of my numerous experiments and various errors that it produced. I didn't think much about it at the time other than trying to get the boards to boot.
 
Last edited:
You can probably trace the exact pad.

I did! But what I found wasn't super exciting... It's pretty much a straight-through connection. This is an overlay of bottom and top halves of the board, and I highlighted all of RX0 in blue. The trace in red is corresponds to error BitTraining BE:RRAC:RX0:GLOBAL1:RX_STATUS

fB5AE1c.jpg


So there's not a whole lot there... which can probably simplify the root cause. It might very well be what @RIP-Felix has been suggesting, with the heat between cell and rsx building up in that spot.

Given the large amount of vias in that area we can do some amount of probing, but to truly test the lines we'd need access to the pads themselves...
 
Hi everyone, I got PS3 fat model ver COK-001 with light turning Green and Red. I got errlog from syscon. Errcode 0xa0092113 Clock Generator Error (IC5004), what does it mean? Does anyone know how to fix it? Thanks!
This is actually a great opportunity to investigate a Hypothesis:

Only 1 other consoles had a 09 2113 error (until now).
  1. @Shawn Shakir, pg94. Had A0092113, A0202120, A0213013. "Console was a gamestop refurb but looks like I was the first person to repair it. Used my ACHI IR PRO, ended up using a replacement RSX from Aliexpress. Not sure if that would be the culrpit or not." He is an experianced reballer. If the RSX was okay, which is a longshoot, then the CPU BGA/Bumps could have been damaged by the heat. There is a clock generator error (IC5004) implicated by the 2113. IC5004 generated clocks and relies on +1.2V_YC_RC_VDDIO reference voltage to carry the signals. That can certainly be affected by RSX/CPU BGA defects. Another possibility is F6302, which supply's 1.7V_MISC to IC6303 to generate +1.2V_YC_RC_VDDIO, among other voltages required to start CPU/SB/GPU
His also had 20 2120/ 21 3013 errors in the log. Those errors are VERY suggestive of F6302 and nearby ICs. I suggest you start there and probe your way around IC6303. The 2113 is occurring at step number 09 which could indicate that there is an earlier voltage issue preventing your console from getting far enough into POST for the 202120/203013 errors to show up.

Probe voltages using the test points and worksheet's I just posted. Also measure resistance at the following testpoint on the motherboard.
COK-001_MB_Ohm_Test_points.jpg


@vyktormvmpay25, @marciolsf, @sandungas, @db260179 (everyone really). I have some observations Ineed help with and think warrant further investigation.

Have you ever noticed FB2103? It's the one the arrow for RSX_VDDA in green is pointing at. YC_RC_VDDA powers the CPU/SB ADC. That ferrite bead and surrounding SMD's filter YC_RC_VDDA for the RSX's use, becoming RSX_VDDA afterward. If FB2103 blows there will be no RSX Redwood Rambus FlexIO ADC power (for analog IO like the HDMI/AV chips). The chips may still get power, but the RSX wont' be able to translate the analog signals onto the digital interface so the CPU/SB can interact with them. I wonder if that could cause 2120 and 2124 errors? I would check these SMDs and voltages next time those errors pop up.

Likewise, I wonder what would happen if Q6310 goes bad? That's the XDR/CPU/SB Yellowstone XIO & Redwood Rambus FlexIO Core voltage. It supplies the CPU side of the reference voltage for the FlexIO interface's digital logic. The RSX has VDDR, for it's FlexIO digital Core voltage. They are separately powered, but communicate across the same interface. Anyway, I wonder Q6310/IC6310 could cause 3013 or 3010 errors? Perhaps even 2120/3013 errors. Not sure, but they might even cause BitTraining errors if the voltage fall's out of regulation.

This is pretty advanced, I know. At least, it is for me. It's why these Power Control Topology and SYSCON error analysis posts are taking so long to research and make sense of. I feel like I'm on the verge of understanding this, but it's eluding me.
 
Yes, but it's hard to say if there's correlation.
I was on mobile when I first saw your response, so I missed the bittraining list. Thanks for that!

Looking at the list, yes, the errors are all over the place. I'm assuming the prefix (either BE: or RSX: ) indicates the source of the error and the end of the message (RX_Status) indicates the action. For example BE:RRAC:RX0:GLOBAL1:RX_STATUS means "I can talk to RSX, but I can't hear back (RX)". Along the same lines, RSX:RRAC:RX0:GLOBAL1:RX_STATUS would be its opposite "I can talk to BE, but I can't hear back"

What I'm not sure about is BE:RRAC:BX0:BX:FLEXIO_ID. If my convention is right, it says there's a problem with BE doing... something? I can't find any references to BX so far, and the meaning of FLEXIO_ID. If I were to make a guess, it has to do with BE unable to start bittraining with RSX.

I have some observations Ineed help with and think warrant further investigation.
I think this makes sense... If syscon has to do "bittraining", it has to have a reference of what's "correct", and that's probably more than just getting a signal back. According to IBM's Cell document (bold highlights are mine):

In this document, the terms calibration and training are synonyms. The Cell BE FlexIO interfaces
perform two types of calibration at POR:

• Bit Calibration—This adjusts the bits within each 8-bit-wide Rambus channel for differences
in circuit, wiring, and loading delays between the multiple bits of the Rambus channel. Bit calibration
also calibrates the signal driver current and driver impedance, and it equalizes the
eight data bit timings to center the data eye around the clock edges.

• Byte Calibration—This equalizes the timings of the 8-bit channel groups that make up the
FlexIO interfaces (IOIF0 and IOIF1). Byte calibration also establishes the correct envelope
framing on the interface by detecting the location of the start-of-envelope pattern.
Only the FlexIO interfaces are calibrated during the POR sequence. The XIO memory interface is
calibrated during the firmware sequence, as described in Section 2.2.2.1 on page 62.

Timings are important, as well as impedance. Everything else leading up to it has to be just right.
 
Last edited:
Your answer is much appreciated. I will follow your instruction and report back. Thanks!
This is actually a great opportunity to investigate a Hypothesis:

Only 1 other consoles had a 09 2113 error (until now).
  1. @Shawn Shakir, pg94. Had A0092113, A0202120, A0213013. "Console was a gamestop refurb but looks like I was the first person to repair it. Used my ACHI IR PRO, ended up using a replacement RSX from Aliexpress. Not sure if that would be the culrpit or not." He is an experianced reballer. If the RSX was okay, which is a longshoot, then the CPU BGA/Bumps could have been damaged by the heat. There is a clock generator error (IC5004) implicated by the 2113. IC5004 generated clocks and relies on +1.2V_YC_RC_VDDIO reference voltage to carry the signals. That can certainly be affected by RSX/CPU BGA defects. Another possibility is F6302, which supply's 1.7V_MISC to IC6303 to generate +1.2V_YC_RC_VDDIO, among other voltages required to start CPU/SB/GPU
His also had 20 2120/ 21 3013 errors in the log. Those errors are VERY suggestive of F6302 and nearby ICs. I suggest you start there and probe your way around IC6303. The 2113 is occurring at step number 09 which could indicate that there is an earlier voltage issue preventing your console from getting far enough into POST for the 202120/203013 errors to show up.

Probe voltages using the test points and worksheet's I just posted. Also measure resistance at the following testpoint on the motherboard.
View attachment 35900

@vyktormvmpay25, @marciolsf, @sandungas, @db260179 (everyone really). I have some observations Ineed help with and think warrant further investigation.

Have you ever noticed FB2103? It's the one the arrow for RSX_VDDA in green is pointing at. YC_RC_VDDA powers the CPU/SB ADC. That ferrite bead and surrounding SMD's filter YC_RC_VDDA for the RSX's use, becoming RSX_VDDA afterward. If FB2103 blows there will be no RSX Redwood Rambus FlexIO ADC power (for analog IO like the HDMI/AV chips). The chips may still get power, but the RSX wont' be able to translate the analog signals onto the digital interface so the CPU/SB can interact with them. I wonder if that could cause 2120 and 2124 errors? I would check these SMDs and voltages next time those errors pop up.

Likewise, I wonder what would happen if Q6310 goes bad? That's the XDR/CPU/SB Yellowstone XIO & Redwood Rambus FlexIO Core voltage. It supplies the CPU side of the reference voltage for the FlexIO interface's digital logic. The RSX has VDDR, for it's FlexIO digital Core voltage. They are separately powered, but communicate across the same interface. Anyway, I wonder Q6310/IC6310 could cause 3013 or 3010 errors? Perhaps even 2120/3013 errors. Not sure, but they might even cause BitTraining errors if the voltage fall's out of regulation.

This is pretty advanced, I know. At least, it is for me. It's why these Power Control Topology and SYSCON error analysis posts are taking so long to research and make sense of. I feel like I'm on the verge of understanding this, but it's eluding me.[/QUOTE
 
I
What I'm not sure about is BE:RRAC:BX0:BX:FLEXIO_ID. If my convention is right, it says there's a problem with BE doing... something? I can't find any references to BX so far, and the meaning of FLEXIO_ID. If I were to make a guess, it has to do with BE unable to start bittraining with RSX.
.

FLEXIO_ID. That's the first step of rambus registers in the list.

At least when I installed a 40nm on the board without orbis mod ( resistors were changed), it produced BitTraining RSX:RRAC:BX0:BX:FLEXIO_ID Which makes sense. It can't recognize the gpu so it can't start the bittraining. But ofc, it might as well happen when the responsible contact is broken as well.
 
This is interesting, as the A0093004 is marked as Power Fail,
A0093004 = RSX_POW_FAIL poweroff state

This console already has 1 tokin replaced with 4 tantals.
Console starts, 3 beep and dies.

I can retry and use my last good RSX that I have in stock,
or... replace all tokins with tantals...

What do you guys think?

Never mind...
Looks like the one of the RSX I bought died during the process...

I replaced it to another one, and console now works.

1200 is an overheating issue (BE Thermal Error) - pls ignore, as I did a quick test without applying the thermal compound on the IHS.


>$ auth
Auth successful
>$ errlog
00000000
# CODE CLOCK

Refurbished RSX#1:
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF

Original RSX:
# A0801601 0B49D885
# A0801701 0B49D884
# A0801601 0B49D84C
# A0801701 0B49D84B

>$ powerstate
00000000
# ATA :OFF
# PCI :OFF
# PCIex:OFF
# RSX :OFF
# GDDR :OFF
# XDR :OFF
# EURUS:OFF
# SB :OFF

>$ bringup
00000000
# [SSM] Bringup Start.
# [SSM] PS0 ok.
# [SSM] PS1 ok.
# [SSM] PS2 ok.
# [SSM] PS3 ok.
# [SSM] PS4 ok.
# (PowerOn State)
OK 00000000
#!
#!Boot Loader SE Version 2.5.0
#!(Build ID: 3318,35708,
#!Build Date: 2008-10-11_00:31:58)
#!
#!Copyright(C) 2008 Sony Computer Entertainment I
#!
#![INFO]: Connecting to Debug Device (SB UART)
# [UCMD] Unknown command.

>$ errlog
00000000
# CODE CLOCK

Refurbished RSX#2:
# A0801200 0B49D850
# A0801200 0B49D824

# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF

# A0801601 0B49D885
# A0801701 0B49D884
# A0801601 0B49D84C
# A0801701 0B49D84B
# A0801601 0B49D818
# A0801701 0B49D817


>$ shutdown
00000000
# [SSM] Shutdown Start.
# [SSM] Shutdown ok.
# (PowerOff State)



And yet... same thing.
Component is working, and HDMI is not working.
I will do some more troubleshooting tomorrow.

I will test the pinout for Panasonic mn864709 based on
https://www.psdevwiki.com/ps3/MN8647091
 
Last edited:
Never mind...
Looks like the one of the RSX I bought died during the process...

I replaced it to another one, and console now works.

1200 is an overheating issue (BE Thermal Error) - pls ignore, as I did a quick test without applying the thermal compound on the IHS.


>$ auth
Auth successful
>$ errlog
00000000
# CODE CLOCK

Refurbished RSX#1:
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF

Original RSX:
# A0801601 0B49D885
# A0801701 0B49D884
# A0801601 0B49D84C
# A0801701 0B49D84B

>$ powerstate
00000000
# ATA :OFF
# PCI :OFF
# PCIex:OFF
# RSX :OFF
# GDDR :OFF
# XDR :OFF
# EURUS:OFF
# SB :OFF

>$ bringup
00000000
# [SSM] Bringup Start.
# [SSM] PS0 ok.
# [SSM] PS1 ok.
# [SSM] PS2 ok.
# [SSM] PS3 ok.
# [SSM] PS4 ok.
# (PowerOn State)
OK 00000000
#!
#!Boot Loader SE Version 2.5.0
#!(Build ID: 3318,35708,
#!Build Date: 2008-10-11_00:31:58)
#!
#!Copyright(C) 2008 Sony Computer Entertainment I
#!
#![INFO]: Connecting to Debug Device (SB UART)
# [UCMD] Unknown command.

>$ errlog
00000000
# CODE CLOCK

Refurbished RSX#2:
# A0801200 0B49D850
# A0801200 0B49D824

# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF
# A0093004 FFFFFFFF

# A0801601 0B49D885
# A0801701 0B49D884
# A0801601 0B49D84C
# A0801701 0B49D84B
# A0801601 0B49D818
# A0801701 0B49D817


>$ shutdown
00000000
# [SSM] Shutdown Start.
# [SSM] Shutdown ok.
# (PowerOff State)



And yet... same thing.
Component is working, and HDMI is not working.
I will do some more troubleshooting tomorrow.

I will test the pinout for Panasonic mn864709 based on
https://www.psdevwiki.com/ps3/MN8647091
Could you ohm test the locations I posted above? We need a working control though. We have a decent set of measurments on the RSX off the board Now we need the same, but installed on the MB. This way we can narrow down the possabilities to the circuit the fault is located.

BTW, are you still talking about that VER console?
 
We had a bit of a discussion recently about livelocks (I don't remember which thread, at this point it's all a jumble in my head). I found an IBM document about cell (the 65nm model, in particular), in which they detail livelock resolution. From what I gather, Cell will tell syscon "hey, there's some blocking here!", and then leaving up to the syscon to address it. In most database engines, this condition is called a "dead lock", and generally ends up with the engine killing off one of the conflicting sessions. I know this is not a database, but the principles seem similar.

This particular line is interesting

The ATTENTION signal is asserted when the Cell BE processor detects a livelock condition. This
signal is the same as the ATTENTION signal used during the power-on reset sequence.

So maybe this is backing up your idea earlier, @RIP-Felix, that a BGA breakdown during operation throws an error (I forget which one you suggested), and then it throws a 3034, which is the same ATTENTION signal that is designed to prevent a full boot.

Appendix C. Livelock Resolution Mode
Livelocks occur when one or more units in a processor element cannot make forward progress.
The Cell Broadband Engine (Cell BE) processor contains several internal mechanisms to avoid
livelock. One example is the pseudorandom retry backoff mechanism. Although this mechanism
eliminates most livelocks, some can still occur. In addition to the internal mechanisms, the
Cell BE processor provides an external notification to the system controller when a livelock is
detected and is not resolved. The notification is in the form of an ATTENTION signal to the
system controller. In response to the ATTENTION signal, the system controller can further alter
the operation by enabling the livelock resolution mode, which alters the operation of the
processor and typically resolves the livelock.
Like livelocks, rare cases exist in which a processing element is making forward progress, but at
an extremely slow rate. This is referred to as starvation. The internal mechanisms are typically
sufficient to prevent starvation. Starvation is not detected by the Cell BE processor. A system
controller can, however, randomly and at a very slow rate, enable and then disable the livelock
resolution mode to resolve any condition causing starvation.
For performance reasons, the system controller should never leave the Cell BE processor in the
livelock resolution mode for an extended period of time. The system should provide a mechanism
for the system controller to notify the operating system that a livelock was detected and resolved.


C.1 System Controller Actions
The ATTENTION signal is asserted when the Cell BE processor detects a livelock condition. This
signal is the same as the ATTENTION signal used during the power-on reset sequence. Other
conditions can also cause the ATTENTION signal to be asserted. The system controller should
monitor the ATTENTION signal using either polling or interrupts. When the ATTENTION signal is
asserted, the system controller should read the Read SPI Status Register (rd_spi_status). If
rd_spi_status[0,7] are both set, the Cell BE processor has detected a livelock condition. The
system controller should then perform the following sequence:
1. Write wr_spi_status[18] = '1' to throttle the PowerPC Processor Element (PPE).
2. Write wr_spi_status[16,17,19] = '1' to quiesce transactions and enable the livelock resolution
mode:
• wr_spi_status[16] = '1' quiesces memory flow controller (MFC) bus transactions.
• wr_spi_status[17] = '1' quiesces PPE bus transactions.
• wr_spi_status[19] = '1' enables livelock resolution mode.
3. Write wr_spi_status[18] = '0' to stop throttling the PPE. (If this step is not performed, the
Cell BE processor will not resolve the livelock.)
4. Write wr_spi_status[4] = '1' to reset and resample the livelock condition by deactivating the
ATTENTION signal.
5. Read rd_spi_status[7] to determine if the livelock is resolved. This bit will return '0' if the livelock
is resolved, or it will return '1' if the livelock is not resolved.
6. If the livelock is resolved, perform these next steps:
a. Write wr_spi_status[16,17,19] = '0' to remove the quiesce for the MFC and PPE bus
transactions, and to disable the livelock resolution mode.
b. Notify the operating system to indicate that a livelock has been detected and resolved.
7. If the livelock is not resolved, then the system controller should assert the CHECKSTOP_IN
signal and perform any system-dependent operations for reporting a checkstop condition.
Optionally, the system controller can perform steps 1 through 6a at random intervals to resolve
any potential starvation conditions. Steps 1 through 6 should be performed sequentially, and the
intervals between performing these steps should be randomly spaced
I'm digging in now and it basically describe the entire Power On Reset and Firmware execution stages to initialize the CELL BE! Moreover it explains the Checkstops and Attention signals.

Pure effing gold!
 
Could you ohm test the locations I posted above? We need a working control though. We have a decent set of measurments on the RSX off the board Now we need the same, but installed on the MB. This way we can narrow down the possabilities to the circuit the fault is located.

BTW, are you still talking about that VER console?

Yes, this is still that VER-001.
https://www.psx-place.com/threads/f...nd-error-reporting.30100/page-114#post-319670

BTW. I fixed SEM-001 by a reflow and it is working fine.

Regarding VER-001.

I checked all hdmi filters - and they passed the continuity mode. They are not shorted.
The hdmi port was tested with continuity mode on the Panasonic IC legs (all working except 1 pin - a spare one - left for hdmi standard compatibility future purposes).
I reflowed the Panasonic IC.

Console works on Component just fine.
Now, I tested the HDMI output on another TV, aaand... it has shown the black screen and the message about "unsupported resolution".
I restarted the console into service menu, but still... hdmi not working.

The smd's around panasonic - all look fine...
Again... this is a new Panasonic IC...



Regarding the ohm test, just to make sure, you want me to measure locations from this picture?
cok-001_mb_ohm_test_points-jpg.35900
 
Yes, this is still that VER-001.
https://www.psx-place.com/threads/f...nd-error-reporting.30100/page-114#post-319670

BTW. I fixed SEM-001 by a reflow and it is working fine.

Regarding VER-001.

I checked all hdmi filters - and they passed the continuity mode. They are not shorted.
The hdmi port was tested with continuity mode on the Panasonic IC legs (all working except 1 pin - a spare one - left for hdmi standard compatibility future purposes).
I reflowed the Panasonic IC.

Console works on Component just fine.
Now, I tested the HDMI output on another TV, aaand... it has shown the black screen and the message about "unsupported resolution".
I restarted the console into service menu, but still... hdmi not working.

The smd's around panasonic - all look fine...
Again... this is a new Panasonic IC...



Regarding the ohm test, just to make sure, you want me to measure locations from this picture?
cok-001_mb_ohm_test_points-jpg.35900
Yes, those are all the relevant RSX voltage lines we were measuring off the board. Only now, we will be measuring them on the board to see if the ohm test's match up to what's expected. But that presents an issue. What's expected? Now that it's soldered on the board, those measurements will read different, due to parasitic inductance and resistance in the entire line.

This is why we need a working VER-001 to compare with. And other MB revisions for the matter.

BTW, I'm going to revise that picture to include some more of the CPU voltages. Thanks to @marciolsf's suggestion I look at IBM's Hardware Installation Guide for the 65nm Cell I have learned a TON more about the CPU, and incidentally about the RSX. It has to be something that affects the HDMI transmitter, but doesn't affect the Power On Reset or FlexIO Calibration (BitTraining). So that should rule out major voltages and solderballs/bumps.

IDK how useful this is to a VER-001, but for a COK-001 the following are required for HDMI Transmitter to work...
  • +1.5V_RSX_VDDIO, +5V_ANA, +3.3V_ANA, and +1.8V_ANA are making it to the Chip
  • +5V_ANA is making it to the Level Shifter (IC2501) and that it is recieving and shifting 1.5V <--> 3.3V for the folowing signals.
    • RS_SPD0 --[1.5V]--> IC2501 --[3.3V]--> RS_SPD0_33
    • /HDMI_INI --[3.3v]--> I assume SYSCON.
      • HDMI is powered and ready to be configured.
  • TH2501 is not blown
  • None of the Zener Diodes are Short
  • None of the Varistors are Short
  • None of the Caps are short
  • Induction Filters are not Short diagonally across themselves.
  • Q2502, Q2504, and Q2507 are not short.
  • R2520 & R2523 are not short or the SYSCON cannot communicate with the HDMI transmitter over I2C.
  • FB2501, FB2503, & FB2504 are not blown.
  • R2502 Isn't blown or HDMI Reset wont work.
Well, you get the idea...
 

Similar threads

Back
Top