The Solution

2021-09-07

On my previous post is promised I’ll post when I have sorted out things. Well I haven’t, and thus there has not been a new post. Before this one.

Instead of this post telling about “The Solution”, it’s actually a debugging story which includes lot of small fixes and improvements.

The problem

I have had few issues for now. I created simple reproducer for those:

  1. Start micropython
  2. Write something (number being best) with newline
  3. Repeat until crash

First I did that manually, but apparently some of the changes I made reduced the reproducibility a lot. Thus I investigated what would the best way to automate this. And it end up being qemu with VNC, with a simple script with calls vncdo.

So first I run my kernel with all the debugging and monitor on:

kvm -smp 2 -cpu host -serial mon:stdio -kernel /home/roivas/projects/cxx-kernel/build93/kernel.bin -monitor telnet::45454,server,nowait -vnc :42

And then run the script, in it’s all glory:

#!/bin/bash

HOST="localhost"
PORT=":42"

while [ 1 ]; do
    vncdo -s "${HOST}${PORT}" type "${RANDOM}${RANDOM}${RANDOM}${RANDOM}${RANDOM}${RANDOM}${RANDOM}${RANDOM}"
    vncdo -s "${HOST}${PORT}" key enter
done

This writes random numbers until the kernel crashes. And yes, so far it has always crashed.

Where does it crash?

First location where it crashes is memcpy. That doesn’t tell that memcpy itself is faulty, but that something disturbs it. Maybe it get faulty values from the beginning?

I tried to print all the values out, and there was no errors. It just happens to create a paging fault. Then I also traced the source of the memcpy to my Bochs video driver, and it seems to call it always with predictable values. On the other hand if I disable graphics, the crash does not reproduce, or at least is way much harder to reproduce.

So it might have something to do with graphics and memcpy?

But the next thing is weird. It crashes in the memory management code. Once something frees the allocated memory, the chunk of memory is returned to a free list. Or at this case binary tree of freed chunks.

Idea is to keep track of all free chunks of memory in a binary tree. Searching proper sized chunk is fast, plus adding + removing is fairly fast. But it seems to crash in the treeAdd method. I traced it a lot, but could not find any bugs there. Instead what I got was clear sign that the tree was corrupted at some point. It’s most likely coming from outside of the code I was checking.

This all lead me to thing, that I most probably have leaky pointers, or some checks missing. I checked all the video code. All the possible places where it could go wrong. Did a lot debugging and tracing. I even implemented traceback.

Traceback

This is a simple feature, which tells where did the call came from. Some people calls it stack trace. Whatever you call it, the function is all the same: where you were when the crash happened, and how did you get there.

On x86 this needs one trick: keep track of the ebp stack and have support from compiler. This means that one every new call, you save you ebp. It’s like:

push ebp
mov ebp, esp
...
leave

Meaning of the leave instruction is to undo what you did in the beginning. Fixing this helped me to print out stack trace on error. Except it doesn’t always work.

And it caused some issues. Back in the days I complained that needed to fix one define to make Micropython to work on my kernel. It turned out that supporting ebp and -fno-omit-frame-pointer was kind of crucial.

There were few places where I just overwrote the ebp so those needed to be fixed. Unfortunately during the fixes I made some mistakes. I had something like mov eax, [esp + 4]. Problem is that since I push ebp to stack, all those indexes were wrong again. Because of this for example my optimized memcpy was not working, since I was unable to detect if MMX was supported on the CPU. That took me way too long to figure out.

But after all this I can have at least some kind of stack trace, but unfortunately sometimes the stack trace itself still caused some crashes.

Interrupts

One thing I spent some time to validate was interrupts and their handlers. I think I rewrote some of the code, and improved it in many ways. And there was actually potential bug, since how I passed the registers to the handler might have been faulty. Similar way I didn’t push enough values to the stack, causing possibly buffer overflows and all all kind of nasty things. Fortunately the stack was kept in balance, and buffer overflow was in worst case just accessing wrong information.

I think I gained bit more stability, but it didn’t fix the issue. I still think it might have something to do with interrupts. The issue only happens when there’s keypresses happening. Keypress triggers interrupt, which calls the keyboard handler. Thus it’s either the interrupts, or something in the keyboard handler. Or then it’s not there at all but that’s the way to trigger it.

I improved the locking in scheduling, and in the keyboard handler. Previously keyboard handler had lock only when reading from keybuffer, so it prevented multiple readers. I was thinking that it’s not possible to have multiple writers. But in SMP machine it might not be true, since it might be that two cores cause interrupts at the same time. Thus it’s crucial to use locking when accessing shared resources. My keyboard handler just wrote there, so I implemented locking on the write side as well. Again this improved the situation. That however caused this to be harder to reproduce. So I think either I’m fixing multiple issues now, and at some point it just disappears. Or then just all the locking changes the timing. Hard to say.

Some improvements

There’s some general improvements on the code as well. I learned that in x86 rep; nop actually instructs CPU to pause. This is generally much better than plain nop or busy loop. since it tells CPU to relax a bit.

There was fixes in the video code. I think I touched and checked quite many places. Still it misses the fix. This might not be the situation I wanted, but it’s improvement.

I know there’s still the bug. It’s really bothering me. But I know I’ll get it eventually. For now maybe I should celebrate the failure. After few weeks(!) of debugging and improvements, I have got forward. All the things I have learned, all the small things, and real fixes. if they would not have been fixed for now, they would be ahead of me later on.

And I could have written this blog post way much earlier.