Chapter 7: Spinning Up The Cores

Jim Carrey, depicting my life story in the movie Liar Liar.

// multicore stuff. Temporary until kernel_old gets set to 1

.equ CORE_1_LAUNCH_ADDRESS, 0x4000008C + (0x10 * 1)

.equ CORE_2_LAUNCH_ADDRESS, 0x4000008C + (0x10 * 2)

.equ CORE_3_LAUNCH_ADDRESS, 0x4000008C + (0x10 * 3)

// start up cores 1, 2, 3

mov32 r1, js2osCoreIdleFunc

mov32 r0, CORE_1_LAUNCH_ADDRESS

str r1, [r0]

mov32 r0, CORE_2_LAUNCH_ADDRESS

str r1, [r0]

mov32 r0, CORE_3_LAUNCH_ADDRESS

str r1, [r0]

// Bits 31:20 - Section base address

// Bit 19 - NS

// Bit 17 - NG

// Bit 16 - S

// Bit 15 - Access permissions bit 3

// Bits 14:12 - TEX[2:0]

// Bits 11:10 - Access permissions bits 1,2: 00=fault, 01=client, 11=manager

// Bits 8:5 - Domain num, [0..15]

// Bit 4 - should be 1

// Bits 3:2 - Cachable(3) / Bufferable (2)

// Bits 1:0 - Always 0b10 for a section page table entry / descriptor

// note: C bit only affects whether or not the cache is written to. Cache is always searched on reads.

// non device/peripheral pages are

// 0x0140E = 001 01 0 0000 0 11 10

// 0x1140E = 1 0 001 01 0 0000 0 11 10

// Bufferable and cacheable, AP is 01 (client), TEX is 001

// memory type is normal, non-shareable, outer and inner write-back, write-allocate

// range is [0..0x3EF00000]

mov r0, #0

mov32 r2, 0x1140E

init_next_table_entry:

// or in the address bits and write it out

orr r3, r2, r0, lsl #20

str r3, [r1], #4

add r0, r0, #1

cmp r0, #0x3F0 // peripherals start at 0x3F000000

blt init_next_table_entry

// step 3: ACTLR SMP bit

// Signals if the processor is taking part in coherency or not.

mrc p15, 0, r0, c1, c0, 1

orr r0, r0, #( 1 << 6 )

mcr p15, 0, r0, c1, c0, 1

.globl js2osInvalidateAllICacheLines

js2osInvalidateAllICacheLines:

mov r0, #0

mcr p15, 0, r0, c7, c5, 0

mov pc, lr

.globl js2osInvalidateAllDCacheLines

js2osInvalidateAllDCacheLines:

// get d$ info

mov r0, #0

mcr p15, 2, r0, c0, c0, 0

mrc p15, 1, r0, c0, c0, 0

// r0 = num sets - 1

movw r3, #0x1ff

and r0, r3, r0, lsr #13

mov r1, #0 // r1 = way loop counter

way_loop:

mov r3, #0 // r3 = set loop counter

set_loop:

mov r2, r1, lsl #30

orr r2, r3, lsl #5 // r2 = set/way cache operation format

mcr p15, 0, r2, c7, c6, 2

add r3, r3, #1

cmp r0, r3

bgt set_loop

add r1, r1, #1

cmp r1, #4

bne way_loop

mov pc, lr

// step 5: turn on the MMU by setting the LSB in the control register

mrc p15, 0, r0, c1, c0, 0

mov32 r1, 0x73027827 // 01110011000000100111100000100111

mov32 r2, 0x20001827 // 00100000000000000001100000100111

and r0, r0, r1

orr r0, r0, r2

mcr p15, 0, r0, c1, c0, 0

// get multiprocessor affinity register, CPU id, and cluster id

mrc p15, 0, r0, c0, c0, 5

and r4, r0, #3

// atomic spinlock

// input:

// r0 - address of atomic

.globl js2osAtomicSpinLockAcquire

js2osAtomicSpinLockAcquire:

push {lr}

mov r1, #0x1 // load the lock taken value

js2osAtomicSpinLock_again:

dmb

ldrex r2, [r0] // load the lock value

cmp r2, #0 // is the lock free

strexeq r2, r1, [r0] // try and claim the lock

cmpeq r2, #0 // did this succeed?

bne js2osAtomicSpinLock_again

dmb

pop {pc}

// atomic spinlock

// input:

// r0 - address of atomic

.globl js2osAtomicSpinLockRelease

js2osAtomicSpinLockRelease:

push {lr}

mov r1, #0

str r1, [r0]

dmb

pop {pc}

Disclaimer: I do not work for Sony. Despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related, I have never worked for Sony. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

If you haven’t guessed by now, here is where you find out I’m a fraud and have been deceiving you. This whole time I’ve been passing this project off as bare metal when its really not. When I first started I had even less of an idea what I was doing than I do now, and decided to take the advice of the excellent Cambridge RPI OS tutorials and piggyback off some of the config files Linux installs to the SD card. This becomes very important as we are about to go multicore.

I kid you not, this is all you have to do to start the cores. Write the address you want a core to jump to to a special mailbox address, and then they just go. This is the magic of piggybacking. Since real documentation for stuff like this is pretty much non-existent, these addresses were obtained by seeing what the Broadcom guys did in Linux (henceforth known as the WWLD strategy) and by scraping together info from the BCM2836 manual. If you’re interested in what the addresses really are, they are CORE0_MBOX0_SET, CORE1_MBOX0_SET, CORE2_MBOX0_SET, and CORE3_MBOX0_SET.

Should we have gone Full Bare Metal Jacket, the process might go something like this. All cores would start up and be executing from the same start address. They would all set up some mode stacks, turn off the MMU, invalidate their respective L1 caches, and set up their page tables. Then each core would query its core number, and wait on an WFI if they aren’t core 0, while core 0 goes on to do some OS initialization before releasing the other cores from their wait. For a great example of how this is done, see chapter 13 of the Cortex A Series Programmers Guide here

While our method of releasing the cores is different, we still have to do a similar setup as above. If you need a refresher on MMUs, go see blog chapter 5a because I’m going to try and stick to describing the changes alone. The primary change is we now have to worry about coherency, and making sure the cores have coherent views on data. This is done with the B (bufferable) and C (cacheable, only affects writes as the cache is always searched) bits, as well as the TEX bits and S (shareable). Technically its the S bit we’re interested in, but we do also want to turn on the D$.

Note we don’t want to map peripheral memory or mailbox memory as cacheable for obvious reasons. Another thing we now have to do that we didn’t have to do before is to set the SMP (symmetric multiprocessing) bit. This signals that a core is taking part in coherency.

After which we should invalidate. Note that the way to invalidate the d$ involves looping over sets and ways. The days of being able to do it via a single coprocessor instruction are over.

And finally we turn on the MMU in the following way

I’m guessing a little clarification on those magic numbers might be useful. The full details are here but the quick summary is that we are turning on the MMU, alignment fault checking, data and unified cache, branch prediction, the instruction cache, and telling the system the translation table descriptor field AP[0] means access flag. Thats about all you need to do. Cores can query what core number they are with the bottom two bits of the multiprocessor affinity register like this

One thing I do want to address is the subject of inner and outer cacheability. On this system, I believe both the L1 and L2 are considered inner domain since they are on-chip. Outer might be another processor’s cache in a big.LITTLE setup for example. This is why we won’t need to worry about the Snoop Control Unit.

And finally, as an added bonus, an atomic spinlock you can play with to implement basic locking. Its not the best way of doing things, but for test purposes it gets the job done! Note DMB is data memory barrier and not a comment on how dumb this code is.

Thats all for now. Since the cores are spinning and doing somewhat useful work, we should probably eventually talk about threads and schedulers. However, before going there I will take a brief detour and implement fibers as a testbed for saving and restoring contexts. Also because fibers are amazingly useful. There is that too.