JMC: Enter the Tubes - Part 2

3 January, 2024

This is part 2 in a series about implementing Japanese Military Chess for the BBC Micro. See part 1 here .

Enter the Tubes

On my mind from the start of this project was the potential to support running on the 6502 Second Processor. The BBC Micro has support for connecting to additional processors via a message passing interface called the Tube. A number of different additional processors were supported by the Tube, including the first ARM chips, but the most interesting one, to me, is the ability to add a second 6502 processor clocked at a higher speed. The original second 6502 ran at 3 Mhz, 50% faster than the main processor, and the FPGA-based second processor board in my BBC Master runs at up to 16 MHz. This isn’t going to be needed for most of the game, but if I add an AI player in the future then the option of a speedup might come in handy.

When the Tube is enabled the second processor, or “parasite” as it is known, takes over the running of the machine. The MOS, the machine’s original operating system, continues to run on what is now called the host or I/O processor. Software running on the parasite can make OS calls and these are serialised across the Tube and executed on the host. In the case of second processors with different instruction sets such as a Z80 or ARM, they will need to have their own software. A ported version of BBC BASIC for example. However, a second 6502 processor can run exactly the same software as the host, only faster, and provided applications use only supported OS calls for their input and output then running over the Tube is seamless.

This presents a problem for games because, while the OS does provide a range of nice drawing primitives via its VDU system, it’s not very fast and rather deficient in its support for bitmapped graphics. Games with high quality animation invariably rely on writing to the screen memory directly as we did in part one of this series. This is not possible to do directly from a second processor.

Many games therefore don’t support running on a second 6502 processor, but one exception is the special second processor edition of Elite . Elite runs the majority of its game logic on the second processor, but it installs a kind of thin-client onto the host processor and transmits the coordinates of graphics to be drawn to it over the Tube. We can do something similar.

Glorious Overlays

When I was little and the gap between my games programming ambitions and my games programming abilities was even larger than it is today, I used to marvel at all the different files that made up DOS-era computer games. If I couldn’t make the game of my dreams then at least I could imagine what array of cryptic 8.3 filenames it might be composed of. It is fitting therefore that we’re going to break this game down into lots of separate files, and even better that DFS file names are only 7 characters long.

One reason for doing this is so that different parts of the game can be loaded as needed in order to fit everything into the 32 KB memory of an unexpanded BBC Model B. Of that 32 KB, 17.5 KB is taken up by the reduced-height screen and at least another ~6 KB for the operating system (assuming DNFS), leaving at most 8.5 KB for the game. However, at this juncture, a more pressing reason is so that we can split the program into parts that run on the host processor or on the second processor depending on configuration.

The main executable, JMC, is now just a loader which uses OSFILE $FF to load various other files into memory and then can itself be overwritten (although we’re not that tight for memory yet). The top-level of the program now lives in a file called _APP, and various parts of the graphics subsystem in _GH, _GTP, and _GTH. I’ll explain this arrangement in more detail shortly.

Incidentally, I made a mistake while writing the loader code and assumed I only needed to supply the first 7 bytes of the OSFILE parameter block because it didn’t use anything after the byte at offset 6. What I didn’t realise is that it writes back the loaded file’s metadata into the parameter block and it corrupted some data I’d placed too close after it. I think programmers of that era were much more acustomed thinking of everything as potentially mutable than I am today.

Are we Tubed?

One of the things I love about programming the BBC Micro is that, ostensibly, you can develop a complete understanding of the system just by reading the Advanced User Guide and the New Advanced User Guide . It’s not always so easy these days, but the past is no panacea and there are plenty of gaps to be found in the old manuals.

Something that confused me about the Tube from reading about it was that I didn’t realise that second processors always took control of the system. I knew this was the case for the 6502 second processor since I have one, but I thought that this was a special case and that others might be controlled from the host in like an accelerator. It seemed odd therefore that the Tube API provided no way of determining the type of second processor attached, but in fact you do know because that’s always the processor you start from.

This misapprehension led me to invent an esoteric method of detecting the Tube. Rather than read the tube presence flag with OSBYTE $EA , I use OSWORD $06 to write a byte into the host processor memory. If that write is visible in the processor I’m running on then clearly it’s not a 6502 second processor, but if it’s not visible then it’s time to engage the Tube handling code. Well, it seems to work!

Tubing Time

In order to efficiently draw on the screen, we need to get some code running on the host processor from the second processor. The easiest way of doing that is by hooking USERV, a vector that receives any unknown OSWORD calls above $E0 on the host. We can load a USERV handler directly into host processor memory from disk using OSFILE $FF and install it by writing a word into the vector with calls to OSBYTE $06. Hence, I assign OSWORDs to all the subroutines that need to be called and when I call one from the second processor the USERV handler will be invoked on the host:

OSWORD	Subroutine	Parameters
$E0	blit_front_to_screen	None.
$E1	slide_back_up	None.
$E2	slide_back_down	None.
$E3	slide_back_left	None.
$E4	slide_back_right	None.
$E5	prepare_front_buffer	Global variables related to window position and clear colour.
$E6	copy_front_to_back	None.
$E7	copy_back_to_front	None.
$E8	draw_sprites	Parameter block containing a list of bitmap pointers along with associated colours, positions, and dimensions.

Only two of the graphics subroutines take parameters. I modified the sprite drawing routine to take its parameters in the standard OSWORD format, with XY pointing to a length prefixed parameter block in memory. The other one, prepare_front_buffer, depends on several global variables specifying the window position, background colour, etc, and I take this opportunity to pass them over to the host in the OSWORD parameter block.

The graphics code is now spread over three files:

_GH: The gfxlib_host module runs on the hosts processor irrespective of whether a second processor is present. It contains subroutines which perform the actual drawing into screen memory.
_GTP: The gfxlib_tube_parasite module lives at the same address in the second processor as gfxlib_host does in the host processor. It contains stubs for all the non-internal subroutines which translate them into OSWORD calls.
_GTH: The gfxlib_tube_host module contains the USERV handler and calls the right subroutine in gfxlib_host in response to trapped OSWORDs.

If we only have a single processor then the game calls gfxlib_host directly. If we have a second processor then the game calls gfxlib_tube_parasite and it seamlessly triggers a call into gfxlib_host on the host via the Tube.

Bad Tubes

After all this work, sprites started appearing on the screen with my second processor enabled. Hurray, it was working… mostly. The board displayed, and the selection cursor moved, but when I pressed the spacebar to animate a piece, the whole thing froze and became unresponsive. Still, it worked fine with the Tube disabled. What was going on?

I spent a lot of time in the B-Em debugger tracing through what was happening. It was clear that the program counter was ending up in the weeds. I could see that my custom OSWORD calls were being sent over the Tube and received on the host processor, until it suddenly just stopped working during the first frame of the piece animation.

The last successful graphics call to arrive on the host was copy_front_to_back. This is used to copy the graphics for underneath a sprite into the back buffer, so that it can be revealed as the sprite starts to move. The front and back buffers are stored in the language RAM area, respectively at 0x400 and 0x500. This area is free for programs to use provided that they don’t intend to return back to the language ROM, so there shouldn’t be a problem…

Ah, hold on a second, there’s something familiar sounding about those addresses. Something at the back of my mind from having digested the (New) Advanced User Guides. When using a second processor, the Tube host software is copied into the language RAM between 0x400 and 0x7FF. Whoops!

I guess the front buffer was close enough to the start of the Tube host software to only take out initialisation code, but writing into the back buffer was a bridge too far and stopped it from working. So, I moved the graphics buffers out of the language RAM and into general program memory, and then it all started working.

This got me thinking though, if the Tube host software needed the language RAM for itself then what else might it lay claim to. Surely, it will need some zero page for itself? According to an article on the memory map of second processor Elite , the Tube host software reserves zero page up to 0x7F for itself. Ouch, that only leaves 16 bytes before OS variables start at 0x90!

I had to squeeze a bit but fortunately a few less important variables could be moved out of zero page and into absolute addresses without too much penalty. Crisis averted.

Side Quest: Bug Hunt

I was showing off the previous tech demo to my friend Nick Chapman and he noticed that the sprite moves slightly slower when moving to the right than it does in the other directions. The obvious conclusion is that sometimes the graphics code is taking longer than a frame to finish drawing and so we’re skipping updating on some frames, but why only when moving right?

I ended up modifying B-Em to add a cycle counting feature and specifically a counter that tracks how long since the start of the last vertical blank. Despite a lot of time spent single stepping in the debugger, I didn’t manage to conclusively capture it in the act, but a couple of things did become clear. Firstly, the graphics code was, at the very least, perilously close to exceeding the per-frame cycles budget. Secondly, the time taken by keyboard scan code appears to vary based on which key is pressed.

The latter is not very surprising as the internal key code for the right arrow, 0x79, is much higher than for the left, up, and down keys (0x19, 0x29, and 0x39 respectively). Scanning the keyboard matrix is performed in software using the system VIA. However, this wouldn’t be an issue if the graphics routines weren’t cutting it so fine.

As touched upon in part 1, the performance of the sprite drawing routine varies depending on the horizontal alignment between bytes in the input bitmap and bytes in the output pixel format. The combination of the worst performing alignment and the more expensive keyboard scan appears to exceed the frame budget. When moving horizontally, the worst performing alignment will occur once in every four frames, and so the sprite was taking 25% more time to move rightwards than in other directions.

Fortunately, I was able to find several opportunities to optimise the sprite drawing routine further. Whereas the worst case inner loop took 120 cycles in the previous version, it now only takes 105 cycles. I was also able to adjust the way the four bitmaps that make up a complete piece are drawn so that, for a given window position, they are spread over three of the four possible alignments. This reduces the impact of the worst case alignment on any one frame.

The cummulative effect of all this was to reduce the time spent drawing graphics by about 7,000 cycles, and the slow down moving rightwards went away.

Demo

Screenshot of JMC 0.0.2

Aside from adding support for the second processor, I also changed the functionality of the demo from the previous release. The main reason for this is that the ability to freely move sprites around the screen is not a requirement for the final game. We only need to be able to animate a piece moving from one space on the board to another. This is simpler because the spaces are located byte aligned locations on the screen and the graphics for the space underneath each piece are straightforward to draw.

In order to support freely moving the sprite while also redrawing the icons on the right-hand side in the previous version, I needed an additional buffer into which I could preserve a copy of the back buffer. I wanted to remove this hack before implementing Tube support because this would have needed to be accommodated in the OSWORD interface.

The new demo displays a red selection cursor that you can move either with the cursor keys or the more conventional Z/X/*/? keys. Pressing the spacebar trick triggers an animation of the piece moving from wherever it’s currently located to the selected space. The return key changes the icon as before.

Download JMC 0.0.2 disk image

Play JMC 0.0.2 with JsBeeb (Single Processor)

Play JMC 0.0.2 with JsBeeb (Second Processor)

Browse source code of JMC 0.0.2