Dreamboard

noggie · 12. Februar 2006

Zitat

Original von dnuos
i already flashed again through web-interface and lan cable... and the 39 bad blocks are always there...
any idea if/how could i solve this?

Wow! And probably really sorry!! For your flash to develop 39 bad blocks all of a sudden could of course be just a coincidence, but I have major problems believing it. Sounds more like problems caused by the failed restore of the backup. Maybe the oob data is wrong, somehow, and interpreted as bad blocks. Only good news in this sorry story, is that your flash is probably all OK, so if a way to just reset everything can be found, you should be back on track.

I'm thinking that maybe flashing an empty image that uses all the flash could help, but I don't know of any such image. Anybody else knows of such a beast?

If you can't find a tested version of such an image, I've made my own attempt at making one. Completely untested, of course. It contains version 35 of the second stage loader, and then lots and lots of 0xFF. It's made using the standard "buildimage" utility, so the oob data should be computed correctly. Up to you, of course, if you want to try it.

One final thing. I notice that the image you have restored has a completely different partition table. My memory is not reliable, but I seem to recall that the layout was changed (even on the 7020) when work started internally in DMM preparing the 7025 image, since gcc produces less dense code for MIPS processors than it does for the powerpc. It's just a shot in the dark, but maybe you could try different images. Different versions of the second stage loader may make a difference, although DMM alone knows what changes they have made lately to that code. In the newest dreambox-buildimage, a comment says

Code

// pre-35 have old layout

, but the latest available sources (that I'm aware of) is for version 28.

EDIT: Just saw wolpi's contribution (thanks wolpi!), and also reading through the old secondstage sources. Writing a blank image probably won't help, since the flashing code won't reset the badblock flag. Back to the thinking box trying to come up with a solution. Good news is still that your flash is probably OK, just need to find a way to clear that badblock flag.

tmbinc · 13. Februar 2006

Wow, great.

You all now officially found the reason why we never made a backup tool, and/or let the 2ndstage loader produce .nfi files with that "backup" option. It just doesn't work without optimizing the jffs2 filesystem afterwards, to keep a number of blocks left in the case of bad blocks. If the jffs2 isn't filled up, you can just skip blocks, but if it is, then you need to do a garbage collection. And that's stuff you really don't want to do inside a second stage loader, because it's basically what mkfs.jffs2 does.

Anyway,

if you screwed up your badblock information (anything more than ~5 bad blocks is definitely not normal, though i don't know the exact numbers guaranteed by samsung etc.), you've lost some information which can't be retrieved. The badblock information is stored at production, and there is no test to reveal if a specific block is bad or not.

One method is of course to erase (and rewrite) a block and check if the erase didn't fail. If it failed, its a bad block for sure (and you just did an illegal operation, as the specification say that you should NEVER attempt to erase a bad block!). If it succeeded, well, then it's undefined. It could be a bad block, but it's probably not.

The next version of the secondstage loader will contain a mode to recover your bad block list. I always wished we could avoid this (as it's out of specs to even touch bad sectors), but now it seems required. Should be in the next version, until that, you have to live with your ~600kb less flash.

The reason for the re-layout was to give more space for the kernel, as the mips kernel doesn't have an integrated gunzip, and the jffs2 doesn't give such a good compression ratio because of the relative small blocksize. Plus some more space for bootlogos was quite nice, and some flashes had a bad block in the secondstage area, which got quite tight with some new features.

Next, the secondstage loader has hardcoded partition sizes (my stupid "marker" thing never really worked out...). So it has to match.

dnuos · 13. Februar 2006

thanks tmbinc for the explanation.

noggie · 14. Februar 2006

Thanks, tmbinc, for your explanation.

I've been trying to come up with a solution for dnuos' flash, feeling bad for f*cking up his box. He's been incredibly sporty about it all (we've been in touch on PM) but I still would very much like to see his flash get back to what it used to be.

I think I have come up with a way to fix it, but this time around I would like get some help verifying that my thinking is sound before asking anyone else to trust me to know what I'm talking about. Please, tmbinc, could you take a look at the following argument and see if you can spot any weaknesses?

1. Before restoring the backup, dnuos' flash showed no sign of bad blocks. The scan done by the backup program showed no bad blocks, and the "prova" utility showed no bad blocks either. Conclusion: The flash is 100% good, has no actual bad blocks, and the bad blocks that suddenly appeared are blocks that have been incorrectly marked as bad.

2. "Fixing" his bad blocks is a case of erasing the bad block flag for all blocks. In his case, there's no need to distiguish between really bad blocks and apparantly bad blocks, since his flash was without any bad blocks at all before doing that fatal restore.

3. The kernel checks that no user program tries to erase a bad block. This takes place in the nand_base.c file in drivers/mtd/nand. Although disabling this check would be a very bad idea in the normal case, it should be possible using the following diff

Diff

--- nand_base.c.orig    2005-06-17 21:48:29.000000000 +0200
+++ nand_base.c 2006-02-14 18:16:02.000000000 +0100
@@ -2050,12 +2050,14 @@
        instr->state = MTD_ERASING;


        while (len) {
+#if 0
                /* Check if we have a bad block, we do not erase bad blocks ! */
                if (nand_block_checkbad(mtd, ((loff_t) page) << this->page_shift, 0, allowbbt)) {
                        printk (KERN_WARNING "nand_erase: attempt to erase a bad block at page 0x%08x\n", page);
                        instr->state = MTD_ERASE_FAILED;
                        goto erase_exit;
                }
+#endif


                /* Invalidate the page cache, if we erase the block which contains
                   the current cached page */

Alles anzeigen

4. The flash_eraseall utility in mtd-utils also contains code to skip bad blocks. Again, removing this code is not a good idea in the normal case, but it can be done using the following diff

Diff

--- flash_eraseall.c.orig       2005-02-17 15:55:06.000000000 +0100
+++ flash_eraseall.c    2006-02-14 18:20:06.000000000 +0100
@@ -61,7 +61,11 @@
        mtd_info_t meminfo;
        int fd, clmpos = 0, clmlen = 8;
        erase_info_t erase;
+#if 0
        int isNAND, bbtest = 1;
+#else
+       int isNAND, bbtest = 0;
+#endif

Alles anzeigen

5. By booting a kernel containing the above patch, and running off a root file system different from the internal flash, erasing the root partition with the modified flash_eraseall would also clear all bad block flags. In the normal case, doing this would be incredibly stupid, but in this particular case the bad block flags are reset to the correct value. BTW, only the root partition need to be fixed, since all his incorrectly flagged bad blocks are on the root partition.

6. A reboot and flashing a new image concludes the fixup operation.

Can you see anything wrong with this? I've tested it as far as I can. I gave my own 7025 flash a couple of fake bad blocks (small test program, source code not provided, using the "MEMSETBADBLOCK" ioctl to set the bad block flag). The "bad" blocks were recognized as bad using all the utilities I had available. I then used a kernel patched as I described above, and a hacked flash_eraseall, booted off the CF, and successfully reset the bad block flag. But my box is a 7025, not a 7020 as dnous', so I'm not 100% certain that it will also work on the 7020.

Any comments, anyone?