I corrected my old code to use the initialization you suggested (as well as the dummy byte, which I saw that according to the specs is necessary before the clock is removed) from:
https://github.com/Krasutski/sdcard_spi_driver/blob/master/spi_sdcard_driver.c
In reality, the cards work even if the clock is paused before a dummy byte is given, but it should be safer to have it, so I included it. And that should also make it so that the next sent command executes correctly.
I looked at my sector transfer code... making it use DMA optimally will be a challenge.
We don't consider a read of a single sector, as that is slow.
A read of multiple sectors consists of:
- A: CMD18 with R1 response
- B: At this point the card fetches the data from flash and prepares it for sending. 0xFF is returned by the card through this time. Once done, a readDataToken is sent rather than the 0xFF.
- C: The card sends 0x200 bytes - one sector, followed by two CRC bytes.
- Then repeats from step B for the next sector.
- D: CMD 12 is sent, which terminates the transfer.
Point A:
Sending a command is slow in general, because of the time necessary to set-up the hardware and also the method of waiting for a response: A response can come at any point within a fixed timeout period (i.e. number of bytes read) after the command's last byte has been sent. This means that we don't know in advance how many bytes we have to transfer, to get the response. So there are two methods:
- sendSdCmd() Transfer a single byte at a time, checking each one, if it is the response byte. This is slow, because many transfers are ran - one per byte and there are 6 command bytes and commonly up to 8 bytes of delay before the 5 response bytes.
- sendSdCmdB() Transfer the maximum possible packet size = 6 + CMD_WAIT_RESP_TIMEOUT + 5. But according to the driver by Krasutski, CMD_WAIT_RESP_TIMEOUT is 100, which is a lot - 4.6 times less than the size of a whole sector.
The first method can be considered as unsafe, because the clock gets paused between every two bytes. However it seems to work fine. So in theory, if the first method is further optimized for speed, this would be best.
One way of doing that would be to not do the initialization in sendCmd() each time. But we want to separate the SD Card code from the SIO2 code. So the solution would be to have a function that can re-run a transfer, with minimal code (and thus be faster).
Point B:
This, and the CRC bytes are part of the reason why DMA is unoptimal. At the start of period B, a transfer should be set-up, which reads a single byte and checks if it is the readDataToken. One might suggest moving the DMA and SIO2 initialization code for the sector transfer before this, and this would make sense, but can't be done, because we are using SIO2 here as well.
Another idea would be to make this a separate queue element and place the sector transfer and CRC after it on the queue. This won't work because AFAIK, we have no way of re-running a single queue element... or maybe there is - through placing a null-element after it... but this again becomes very involved in the SIO2.
So in short - we can't just use DMA to transfer multiple sectors at once, because we have to wait for each one and poll the card to check when it is ready to send it.
We can't sleep the thread while waiting for the card, because we have to actively check the card for its state.
The only thing that can be done is probably releasing some cycles to other threads, but this can make the SD Card transfer too slow.
Point C:
Again requires setting-up SIO2, which is slow. We can't even do the whole sector transfer at once, because a single queue element can transfer at most 0x100 bytes - half of the sector size.
DMA would make getting the data to memory faster. But keep in mind, that a single memory access under program control and a DMA access to the same memory actually take the same amount of time, but for the program execution overhead. And also that with the PIO code we actually save the time of one writing to RAM and the reading the data back to reverse the bits. But will also mean that the bits order reversing would have to happen later. In practice, the DMA transfer of the next sector would first have to be started, and then the code to reverse the bits of the previous should run. Now taking into account the waiting for the card to fetch the sector data, and the CRC bytes, so the code will become quite complicated.
I tested without the command sending code and card polling - i.e. with only the raw PIO sector + CRC transfer code and the results are:
Code:
MX4SIO: sz 00000200 1699 kB/s
MX4SIO: sz 00000400 1787 kB/s
MX4SIO: sz 00000600 1799 kB/s
MX4SIO: sz 00000800 1762 kB/s
MX4SIO: sz 00000A00 1802 kB/s
MX4SIO: sz 00000C00 1805 kB/s
...
MX4SIO: sz 00004000 1810 kB/s
MX4SIO: sz 00004200 1808 kB/s
MX4SIO: sz 00004400 1810 kB/s
MX4SIO: sz 00004600 1811 kB/s
So 1800kB/s is the maximum with the current PIO code (no SD Card needs to be connected for the test). Maybe I'll implement a function of the driver to be able to do such a test, so we can compare the performance of the different PS2s.
With a bit more optimization, regardless bit-reversing is done or not MX4SIO: sz 00001800 1816 kB/s. The code is extremely short now, so there is no way to get better speed.
@Maximus32 How do you propose we let other IOP drivers use the free time while this one is neither doing DMA not reversing the bits?
Maybe by sleeping the current thread and having the DMA interrupt or transfer completion wake it up?
BTW, if we register a DMA interrupt, that will mean that SIO2MAN can't have its own registered (which it doesn't anyway, so I guess this is OK).
EDIT 3: Actually, if it can be done only with a few DelayThread() at key places, it would be better, and won't require having to deal with interrupts, ect. What do you think about that?
Checking the code, one of the other issues with DMA is the trigger - the transferring of a DMA block from SIO2 FIFO to RAM is done at the end of each queue element. So if we use two elements, 0x100 bytes each, this is what will happen:
- SIO2 receives 0x100 bytes from SD Card - this takes a fixed, long period of time.
- DMA is triggered by the completion of the queue element and transfers the 0x100 bytes to RAM.
- The above is repeated.
So this way, because the DMA transfer starts when half the sector has been received, we get a further delay of the duration of the DMA transfer of 0x100 bytes in addition to the SIO2 SPI transfer duration. One way around this would be to cut the transfer in more blocks, and queue elements, respectively.
Surprisingly, actual tests show that cutting the transfer to blocks of less than 0x100 bytes, makes it slower. Maybe because interrupting the SIO2 transfer slows it down (tests showed that accessing some SIO2 registers too often(or maybe the FIFO) slows the SIO2 down), and because this way the card has less time to fetch the data - because the (second 0x100-byte) DMA transfer takes place basically around the time the card knows to fetch the next sector.
It seems I can't get the DMA transfer rates as high as those of PIO mode. Still DMA mode will be necessary.
Some speed comparisons:
https://docs.google.com/spreadsheets/d/1ZVjVafwJi0pioURmA-jm4bolBWTT1notffAecYBVcL4/edit?usp=sharing
D,BR = DMA + bit-reversing (the DMA-only test does not include bit-reversing)
EDIT 1:
@Maximus32 You did once test the USB Flash raw block-device transfer speeds, right? We can compare them with the SD Card ones, to see what kind of improvement we can expect from this driver.
EDIT 2:
Measuring the parts of the transfer. After adding the time-recording code, the speeds dropped as follows:
New card, DMA, no CRC noBitRev, but for the last-sector one:
from: MX4SIO: sz 00004600 1733 kB/s MX4SIO: sz 00000400 1352 kB/s
to:
MX4SIO: sz 00000200 305 kB/s
MX4SIO: sz 00000400 1111 kB/s
MX4SIO: sz 00000600 1249 kB/s
MX4SIO: sz 00000800 1333 kB/s
MX4SIO: sz 00000A00 1386 kB/s
MX4SIO: sz 00000C00 1427 kB/s
...
MX4SIO: sz 00003800 1595 kB/s
...
MX4SIO: sz 00004600 1607 kB/s
New card, DMA, no CRC noBitRev, but for the last-sector one (8-sector transfer):
Code:
0000: loopBgn 5 sec 361369 usec waitData 1005 dmaSetup 9 sioAndDmaRun{ bitRev 5 freeTime 269 } crcRead 9 lastSectBR 7 toNextIter 4
0001: loopBgn 5 sec 362677 usec waitData 11 dmaSetup 6 sioAndDmaRun{ bitRev 3 freeTime 269 } crcRead 7 lastSectBR 4 toNextIter 3
0002: loopBgn 5 sec 362980 usec waitData 10 dmaSetup 6 sioAndDmaRun{ bitRev 4 freeTime 269 } crcRead 7 lastSectBR 4 toNextIter 3
0003: loopBgn 5 sec 363283 usec waitData 10 dmaSetup 6 sioAndDmaRun{ bitRev 4 freeTime 269 } crcRead 7 lastSectBR 4 toNextIter 3
0004: loopBgn 5 sec 363586 usec waitData 10 dmaSetup 6 sioAndDmaRun{ bitRev 4 freeTime 269 } crcRead 6 lastSectBR 4 toNextIter 4
0005: loopBgn 5 sec 363889 usec waitData 10 dmaSetup 5 sioAndDmaRun{ bitRev 4 freeTime 269 } crcRead 7 lastSectBR 4 toNextIter 3
0006: loopBgn 5 sec 364191 usec waitData 10 dmaSetup 6 sioAndDmaRun{ bitRev 3 freeTime 269 } crcRead 7 lastSectBR 4 toNextIter 3
0007: loopBgn 5 sec 364493 usec waitData 10 dmaSetup 6 sioAndDmaRun{ bitRev 4 freeTime 269 } crcRead 6 lastSectBR 113
One sector (0x200 bytes) transfer (above) takes ~303us => 1650kB/s, which excludes the initial command sending overhead and the initial waiting for data ("waitData 1005" at the first entry above).
(This is perhaps also why I initially disregarded DMA as a usable mode, as the PIO raw transfer speeds + bit-reversal) are: sz 00004600 1811 kB/s.)
loopBgn - Time of (sectors) loop beginning. All other times are in uSec
waitData - Waiting for the card to prepare the data for sending.
dmaSetup - The time duration of the code that sets-up DMA and the SIO sector transfer.
sioAndDmaRun {bitRev} - Time of the bit-reversal while sector data is being transferred to SIO2 FIFO. (No bit-rev for the above test.)
sioAndDmaRun {freeTime} - Duration after the bit-reversal to the end of the time while IOP is free, as SIO transfer and DMA haven't yet completed.
crcRead - Duration while CRC is being read.
lastSectBR - Duration of the bit-reversal of the last sector (only done for the last sector).
toNextIter - Duration to the next loop iteration.
The durations of the initial command sending and the stop-command are not measured.
New card, DMA, no CRC, with BitRev:
Code:
MX4SIO: 0000: loopBgn 5 sec 254453 usec waitData 1004 dmaSetup 9 sioAndDmaRun{ bitRev 4 freeTime 269 } crcRead 10 lastSectBR 6 toNextIter 5
MX4SIO: 0001: loopBgn 5 sec 255760 usec waitData 12 dmaSetup 6 sioAndDmaRun{ bitRev 121 freeTime 152 } crcRead 6 lastSectBR 5 toNextIter 4
MX4SIO: 0002: loopBgn 5 sec 256066 usec waitData 10 dmaSetup 5 sioAndDmaRun{ bitRev 119 freeTime 154 } crcRead 6 lastSectBR 4 toNextIter 5
MX4SIO: 0003: loopBgn 5 sec 256369 usec waitData 10 dmaSetup 4 sioAndDmaRun{ bitRev 119 freeTime 154 } crcRead 6 lastSectBR 5 toNextIter 4
MX4SIO: 0004: loopBgn 5 sec 256671 usec waitData 10 dmaSetup 5 sioAndDmaRun{ bitRev 119 freeTime 154 } crcRead 6 lastSectBR 4 toNextIter 4
MX4SIO: 0005: loopBgn 5 sec 256973 usec waitData 10 dmaSetup 5 sioAndDmaRun{ bitRev 119 freeTime 154 } crcRead 6 lastSectBR 4 toNextIter 5
MX4SIO: 0006: loopBgn 5 sec 257276 usec waitData 10 dmaSetup 5 sioAndDmaRun{ bitRev 119 freeTime 154 } crcRead 6 lastSectBR 4 toNextIter 4
MX4SIO: 0007: loopBgn 5 sec 257578 usec waitData 11 dmaSetup 5 sioAndDmaRun{ bitRev 119 freeTime 153 } crcRead 34 lastSectBR 113
It can be seen, that because the total time of the SIO trandfer + DMA transfer is fixed, the sum of sioAndDmaRun: bitRev + freeTime is always the same = ~273.
The bit-reversal appears to take about half the available time, however it still makes the transfer take longer, which may be due to compeeting for access to the RAM with the DMAC or something else.
Also the "free time" - sioAndDmaRun, is not uniform - it would start with a SIO transfer from the SPI to FIFO, then once 0x100 bytes have been transferred, the transfer of the first DMA block will begin, which would pause program execution, and resume it aftet the block transfer ends. Meamwhile, it is assumed that without waiting for DMA to complete, the second queue entry transfer begins (next 0x100 bytes). However it is possible, that in order to prevent buffer overflow, the transfer is paused, until the buffer is emptied enough by DMA.
All DMA tests use 32-bit bit-reversal (in bytes) code, with the mask variables asigned in advance. To be able to use a different optimization setting than the rest of the code, it can be moved to its own object file and have a different -O setting in the Makefile.
Unlike previous tests (without the time-recording code), now the duration of a sector transfer is 303us both with and without bit-reversing.
Adding DelayThread(1) after the bit-reversing, under sioAndDmaRun, makes the speed drop from sz 00004600 1606 kB/s to sz 00004600 1473 kB/s (bit-reversing disabled). One way to reduce the negative effect of this, would be to do it once every several sector tranfers.
A good way to do something useful while waiting for slow cards to fetch the data to be read, is placing a DelayThread(1); in the code below.
Fast cards usually end the loop before 4 iterations have passed, so there is no slowdown observed for them.
Code:
i = 0;
rdPollBytePrepare(port);
do {
dataToken = rdPollByteWait(0xFF, port);
i++;
if (i >= 5) DelayThread(1); //for slow cards, we can let other threads run for a bit.
if (i> 0x800) { mprtf("\n ERR: Timeout on waiting for data (token). "); break; }
} while (dataToken == 0xFF);
rdPollByteComplete(port);
EDIT 4: Bit-reversing: 0x200 bytes for 120us is 4170 kB/s, - close to the 4500kB/s you got on a FAT PS2. Using your function, gives about the same speed. Using -O3 with your function takes 82us (and there is a bit more code around it, which may be why I am not reaching the 73 us for 6765kB/s you achieved. However using -O3 for the whole IRX drops the total speed from sz 00004600 1605 kB/s to 1583 kB/s (so the objects better be separated maybe). With -O2 the bit-reversal takes 80us and the speed is 1600 kB/s. Or I can copy the optimized asm from -O3 to an inline func.
EDIT 5: The PIO functions can also have DelayThread added to the code that waits for the card to get the data or to write it to flash.
EDIT 6: I don't remember why I thought that the driver wasn't working on PPC-IOP models. I tested the new driver and it works fine, just slow - 1320 kB/s is the max PIO speed for the fastest card I have, while 1180 kB/s is the fastest DMA speed. I think the reason is that MC slot 2 (port 3) is used, which has a bit-rate override by DECKARD. Which is why last time I was testing with PPC-IOP patching code, trying to remove the override, and maybe that is where some incompatibility happened. In theory, testing at MC slot 1 should solve the problem.
One other thing I changed, was that now -O2 is always enabled, so maybe this changed the compatibility, but I doubt it.
EDIT 7: Tests on SCPH-79000 PPC-IOP.
DECKARD has several overrides for settings to the SIO2 made from the IOP, including such that write to registers (0xBF80825C) when other registers have been written.
The maximum transfer clock is capped at 24MHz (=48MHz/2) and values 0 and 1 result in the same clock, unlike on SCPH-30000, where they result in misformed data and odd logic reactions. Lower values work as expected (value 3 -> 48/3=16MHz).
The minimum inter-byte duration appears to be 352us, which is a LOT, and also most likely the reason why transfer speeds on the PPC-IOP are slower. It is 176us on the MIPS-IOP. This is probably because 0xBF80825C bits 23:16 is 0 on the MIPS IOP, while DECKARD forces it to 3, and according to my notes:
Code:
Total number of SCK-cycles between bytes = 2 + max(pCtrl1.23:16, 2); So the minimal period is 2+2=4 [SCK-cycles]. Specifying values 0,1 or 2 has the same effect - 4 [SCK-cycles], while a value of 3 results in 5 [SCK-cycles]. This effect is present regardless the divisor value (tested only with values 4-6).
In theory, because the PPC-IOP runs on a higher frequency than the MIPS-IOP, the SIO2 should be able to run at higher frequency as well, but this does not seem to be the case (although the SIO2 uses the 48MHz clock, so perhaps that is why). It might be that this additional delay was necessary due to synchronization of the SIO2 FIFO with the faster PPC-IOP clock.
EDIT 8:
The PPC-IOP SIO2 has an additional inter-byte delay of 244ns and the smallest value added cycles to that is 3, which results in the ~340ns inter-byte period measured above.
clockSpeedDiv = 2 = 24MHz interBytePeriod = 0xBF80825C.23:16 :
9 = 588 ns
8 = 548 ns
7 = 508 ns
6 = 464 ns
5 = 424 ns
4 = 384 ns
3 = 340 ns minimum
2,1,0 = 340
period = 244 + clkCycleDuration * (interBytePeriod[cy] -0.5)
0xBF80825C.23:16 = 4 -> 3.5cy:
244 + 3.5cy * 40ns(/2 24MHz) = 244 + 140 = 384ns
244 + 3.5cy * 63ns(/3 16MHz) = 244 + 220 = 464ns
244 + 3.5cy * 84ns(/4 12MHz) = 244 + 294 = 538ns
I removed the DECKARD code that overrides register values, and it appears that any lower values still result in the durations corresponding to the overridden values. So there is nothing to gain from patching DECKARD, and the fastest mode has inter-byte delay of 352 ns, (opposed to 176 ns on a MIPS-IOP). Maybe this is due to additional synchronization between the SIO2 FIFO and the SIO2 SPI shift register and/or the new PPC-IOP core, running at higher speed, while the SIO2 is clocked by the 48MHz clock.
This means 40ns(24MHz) * 8bits = 320ns; 320 + 352 = 672ns per byte., which means 1453kB/s is the maximum bandwidth of the SIO2 on a PPC-IOP, while on the MIPS-IOP it is 1969kB/s.
(Of course DECKARD patching can still be of use for projects that emulate another device using an SD Card.)
EDIT 9: "Speed testing storage devices" thread from psx-scene:
https://ia802907.us.archive.org/11/items/psx-scene-processed-archive/part0293.html#T157395P1
For USB and other devices' speeds.
EDIT 10: I don't know how, but now I am getting speeds closer (basically the same) for DMA as for PIO. Maybe I had forgotten to remove DelayThread from the polling code or maybe it is some timing parameters I changed.
Note that the speed tests in EDIT 6 are for the fastest card. For the slowest card, the DMA speed on MIPS-IOP is ~ 1100kB/s.
Above only read speeds are measured. Writes to flash generally take more time, which may not matter much though, when more than one sector is written.
EDIT 11:
@Maximus32 How do I handle the SD Card detection (insertion)? For removal, I'd have to make detect the presence on each operation and if it fails to, then unregister the block device. But how to detect insertion? It would have to be a thread, with an infinite loop and a DelayThread(5000000) inside, but should this be in the driver, or should the BDM or the user driver call that to check for new cards?