JayStation2 Dev Blog

Chapter 9: FIRST TRIANGLE!!1!

No time to come up with clever captions. This is the most exciting blog post I've ever done and I've no time to waste

Disclaimer: I do not work for Sony. Despite the disturbing percentage of my shirts, jackets, and bookbags that are PlayStation dev-related, I have never worked for Sony. I do, however, have many friends that work at Sony, some of which I hope will call off the corporate lawyers. JayStation is in no way associated with Sony or PlayStation, and any stupid things I say represent only my own ineptitude and silliness.

No time for clever intros, we have a GPU to p0wn. At the highest level, this post is divided into four sections: initializing the GPU, creating the framebuffer, building command buffers, and misc info. Let's go!

Initializing the GPU is fairly simple, but totally necessary if you don't want to be reading 0xDEADBEEF from all the interesting registers. Initialization is done through the mailbox, passing the following data in. For reference the interesting mailbox tags are all listed at https://github.com/raspberrypi/firmware/wiki/Mailbox-property-interface

.align 4



     .word 0x00000000

     // Sequence Of Concatenated Tags

     // I am betting these don't work because we're using a fb negotiation-specific channel

     .word Get_Firmware_Revision

     .word 0x00000004

     .word 0x00000000


     .word 0

     .word Get_Board_Model

     .word 0x00000004

     .word 0x00000000


     .word 0

     .word Set_Physical_Display

     .word 0x00000008

     .word 0x00000000

     physical_display_x: .word SCREEN_X

     physical_display_y: .word SCREEN_Y

     .word Set_Virtual_Buffer

     .word 0x00000008

     .word 0x00000008

     virtual_display_x: .word SCREEN_X

     virtual_display_y: .word SCREEN_Y

     .word Set_Depth

     .word 0x00000004

     .word 0x00000004

     bits_per_pixel: .word BITS_PER_PIXEL

     .word Set_Virtual_Offset

     .word 0x00000008

     .word 0x00000008

     .word 0

     .word 0

     .word Allocate_Buffer

     .word 0x00000008

     .word 0x00000008

fb_ptr: .word 0

fb_size: .word 0

     // confirmation commands for my own "sanity"

     .word Get_Pitch

     .word 4

     .word 0

     pitch: .word 0

     .word Get_Depth

     .word 4

     .word 0

     depth: .word 0

     .word Get_Physical_Display

     .word 0x00000008

     .word 0x00000000

     disp_x: .word 0

     disp_y: .word 0

// 0x0 (End Tag)

     .word 0x00000000


.align 4 // 16

// Mailbox Property Interface Buffer Structure


     // Buffer Size In Bytes (Including The Header Values, The End Tag And Padding)


     // Buffer Request/Response Code

     // Request Codes: $00000000

     // Process Request Response Codes: $80000000 Request Successful, $80000001 Partial Response


     .word 0x00000000

     // Sequence Of Concatenated Tags

     // Tag Identifier

     .word Set_Clock_Rate

     // Value Buffer Size In Bytes

     .word 0x00000008

     // 1 bit (MSB) Request/Response Indicator (0=Request, 1=Response), 31 bits (LSB) Value Length In Bytes


     .word 0x00000008

     // Value Buffer (V3D Clock ID)

     .word CLK_V3D_ID

     // Value Buffer (250MHz)

     .word 25000000


     // Tag Identifier

     .word Enable_QPU

     // Value Buffer Size In Bytes

     .word 0x00000004

     // 1 bit (MSB) Request/Response Indicator (0=Request, 1=Response), 31 bits (LSB) Value Length In Bytes


     .word 0x00000004

     // Value Buffer (1 = Enable)

     .word 1

     // $0 (End Tag)

     .word 0x00000000


.equ NUM_TILES_X, 10

.equ NUM_TILES_Y, 8

// wanted format is

//          Tile_Coordinates x, y

//          Branch_To_Sub_List BIN_ADDRESS + ((y * 10 + x) * 32)

//          Store_Multi_Sample / Store_Multi_Sample_End


     .set countery, 0

     .rept NUM_TILES_Y

          .set counterx, 0

          .rept NUM_TILES_X

               Tile_Coordinates counterx, countery

               Branch_To_Sub_List BIN_ADDRESS + ((countery * NUM_TILES_X + counterx) * 32)

               .if ( ( counterx == ( NUM_TILES_X - 1 ) ) && ( countery == ( NUM_TILES_Y - 1 ) ) )





               .set counterx, counterx + 1


          .set countery, countery + 1



.align 2



     Clear_Colors 0xFF00FFFF, 0, 0, 0



     Tile_Rendering_Mode_Configuration 0x00000000, SCREEN_X, SCREEN_Y, Frame_Buffer_Color_Format_RGBA8888


     Tile_Coordinates 0, 0

     Store_Tile_Buffer_General 0, 0, 0 // Store Tile Buffer General (R)




.align 2


     Tile_Binning_Mode_Configuration BIN_ADDRESS, BIN_MEM_SIZE, BIN_BASE, NUM_TILES_X, NUM_TILES_Y, Auto_Init_Tile_State_Data



     Clip_Window 0, 0, SCREEN_X, SCREEN_Y

     Configuration_Bits Enable_Forward_Facing_Primitive + Enable_Reverse_Facing_Primitive, Early_Z_Updates_Enable

     Viewport_Offset 0, 0


     Vertex_Array_Primitives Mode_Triangles, 3, 0



.align 4 // 128-Bit Align


     // Flag Bits: 0 = Single Threaded, Frag 1 = Point Size In Shaded Vert Data,

     // 2 = Enable Clipping, 3 = Clip Coordinates Header Included In Shaded Vertex Data

     .byte 0

     .byte 3 * 4 // Shaded Vertex Data Stride

     .byte 0 // Fragment Shader Number Of Uniforms (Not Used Currently)

     .byte 0 // Fragment Shader Number Of Varyings

     .word FRAGMENT_SHADER_CODE // Fragment Shader Code Address

     .word 0 // Fragment Shader Uniforms Address

     .word VERTEX_DATA // Shaded Vertex Data Address (128-Bit Aligned If Including Clip Coordinate Header)

.align 4 // 128-Bit Align


     // Vertex: Top

     .hword 320 * 16 // X In 12.4 Fixed Point

     .hword  32 * 16 // Y In 12.4 Fixed Point

     .single 0e1.0 // Z

     .single 0e1.0 // 1 / W


     // Vertex: Bottom Left

     .hword  32 * 16 // X In 12.4 Fixed Point

     .hword 448 * 16 // Y In 12.4 Fixed Point

     .single 0e1.0 // Z

     .single 0e1.0 // 1 / W


     // Vertex: Bottom Right

     .hword 608 * 16 // X In 12.4 Fixed Point

     .hword 448 * 16 // Y In 12.4 Fixed Point

     .single 0e1.0 // Z

     .single 0e1.0 // 1 / W

.align 4 // 128-Bit Align


     // Fill Color Shader

     .word 0x009E7000 ;

     .word 0x100009E7 // nop // nop // nop


     .word 0xFFFFFFFF // RGBA White

     .word 0xE0020BA7 // ldi tlbc, 0xFFFFFFFF

     .word 0x009E7000 ;

     .word 0x500009E7 // nop // nop // sbdone

     .word 0x009E7000 ;

     .word 0x300009E7 // nop // nop // thrend


     .word 0x009E7000 ;

     .word 0x100009E7 // nop // nop // nop

     .word 0x009E7000 ;

     .word 0x100009E7 // nop // nop // nop

address: 0x000117A8, command number: 0x00000073, 115_TILE_COORDINATES

    Tile Row Number (int8): 0x00000006

    Tile Column Number (int8): 0x00000001

address: 0x000117AB, command number: 0x00000011, 17_BRANCH_TO_SUB_LIST

    address: 0x00400200

address: 0x00400200, command number: 0x00000030, 48_COMPRESSED_PRIM_LIST

    jumping to compressed prim handler func: 0x0000A940

address: 0x00400202, command number: 0x00000038, 56_PRIM_LIST_FORMAT

    0,1,2,3 = Points, Lines, Triangles, RHT: 0x00000002

    1,3 = 16-bit index, 32-bit x/y: 0x00000001

    storing compressed prim handler func ptr: 0x0000A940

address: 0x00400204, command number: 0x00000041, 65_NV_SHADER_STATE

    Memory Address of Shader Record (in multiples of 16 bytes): 0x000119F0

address: 0x00400209, command number: 0x00000060, 96_CONFIG_BITS

    Early Z updates enable: 0x00000001

    Early Z enable: 0x00000000

    Z updates enable: 0x00000000

    Depth-Test Function (0-7 = never, lt, eq, le, gt, ne, ge, always): 0x00000000

    Coverage Read Mode (0,1 = Clear on read, Leave on read): 0x00000000

    Coverage Update Mode (0-3 = nonzero, odd, or, zero): 0x00000000

    Coverage Pipe Select: 0x00000000

    Rasteriser Oversample Mode (0,1,2,3 = none, 4x, 16x, Reserved): 0x00000000

    Coverage Read Type (0 = 4*8-bit level, 1 = 16-bit mask): 0x00000000

    Antialiased Points and Lines (not actually supported): 0x00000000

    Enable Depth Offset: 0x00000000

    Clockwise Primitives: 0x00000000

    Enable Reverse Facing Primitive: 0x00000001

    Enable Forward Facing Primitive: 0x00000001

address: 0x0040020D, command number: 0x00000067, 103_VIEWPORT_OFFSET

    Viewport Centre X-coordinate (sint16): 0x00000000

    Viewport Centre Y-coordinate (sint16): 0x00000000

address: 0x00400212, command number: 0x00000066, 102_CLIP_WINDOW

    Clip Window Left pixel coordinate (uint16): 0x00000000

    Clip Window Bottom pixel coordinate (uint16): 0x00000000

    Width (in pixels): 0x00000280

    Height (in tiles): 0x000001E0

address: 0x0040021B, command number: 0x00000030, 48_COMPRESSED_PRIM_LIST

address: 0x0040021C

    Relative branch, target: 0x00000045

address: 0x00400AC0

    Coding 2, 2fs complement difference between new tri index (1) and new tri index (0): 0x00000000

    Coding 2, 2fs complement difference between new tri index (2) and new tri index (0): 0x00000002

    Coding 2, Absolute new tri index (0): 0x00000000

address: 0x00400AC4

    Coding 1, 2's complement difference between new tri index (0) and prev tri index (0): 0x00000000

    Coding 1, 2's complement difference between new tri index (1) and prev tri index (1): 0x00000003

    Coding 1, 2's complement difference between new tri index (2) and prev tri index (2): 0x00000003

address: 0x00400AC6

    Coding 1, 2's complement difference between new tri index (0) and prev tri index (0): 0x00000000

    Coding 1, 2's complement difference between new tri index (1) and prev tri index (1): 0x00000003

    Coding 1, 2's complement difference between new tri index (2) and prev tri index (2): 0x00000003

address: 0x00400AC8

address: 000400AC9, command number: 0x00000012, 18_RETURN_FROM_SUB_LIST

address: 0x000117B0, command number: 0x00000018, 24_STORE_MULTI_TILE_COLOR_BUF

Creating the framebuffer is equally easy. Its another mailbox write, but this time the tags look something like this

In this simplified example, I'm hardcoding the screen X, Y, and bits per pixel, but in real life when you pass these values into your init function and you've enabled the cache, you're going to need to flush the FB_STRUCT range to make the mailbox data visible to interested parties. Firmware revision and board model both come back as 0 currently, so I need to look into why this is.

This brings us to the real core of today's entry: command buffers. Even a high level overview has so much to cover, so this is going to have to be another multi-part post. Today I'm focusing on the front end's Primitive Tile Binner (PTB) and its two threads: binning and rendering. The binning thread is responsible for setting up the tile binning mode configuration, supplying blocks of binning memory for the render thread command buffers, and specifting state data, shaders, and primitive lists. The rendering thread then goes through the commands generated by the binning thread and... well... renders stuff.

Figure N: The binning thread command buffer allocates 32 byte blocks of commands and inline geometry for each tile. Then the render thread command buffer does a call and return to execute each tile's block. If more than 32 bytes is needed, the commands can contain jumps to other blocks

The render thread goes through the following flow. For each tile XY, we add a Tile_Coords X, Y command. We then branch to the command list created for that particular tile by the binning thread. Finally we end with a Store_Tile command, and flush if its the last tile. Doing this for 80 tiles is super tedious, so through the magic of GNU assembler macros, I proudly present the lazy way

Even if you ignore the messy macro, at least look at the comment above showing the three commands the render thread must execute for every tile. The Branch_To_Sub_List command branches to the command buffer created for the tile by the binning thread. Using this macro, the final render command buffer will look like this:

Ignoring the semaphore for now, we have a Clear_Colors command with the color 0xFF00FFFF, a Tile_Rendering_Mode_Configuration to set up the framebuffer address, screen dimensions, and color format, and finally our macro to generate all the per-tile branches. That 0x00000000 in Tile_Rendering_Mode_Configuration is for the framebuffer address, but since we don't know the address at assemble time we have to patch it in after initializing the GPU and setting up the framebuffer. Again, don't forget to flush so the patched in framebuffer address is visible to the PTB.

Now lets take a look at the binning thread command buffer format.

Most of this is pretty self explanatory. We're setting some state, starting binning, and flushing when done. That semaphore only increments when binning is done and everything is flushed. The two most interesting things in this command buffer are the binning mode config and the NV shader state. BIN_ADDRESS and BIN_SIZE are the address and size of the memory pool the binning thread uses to create render command buffers. In experiments allocation block size seems to be 32 bytes, and the remaining size can be read from the BMPRS register. Out of memory conditions can be handled with an interrupt.

The NV in NV_Shader_State has nothing to do with Nvidia, but rather means something like no vertex. The chip has three pipeline modes:

0) GL is your normal vert+frag thing

1) NV mode has no vert shader and uses pre-shaded vertices stored in memory

2) VG mode where vertices are supplied directly from the input primitive list as XY coordinates only

Wanting to get something on screen sooner rather than later, I went with NV mode, and will have to figure out how those pesky interpolants work later. Moving on to the NV shader state

Credit where credit is due, my vertex data and frag shader came from Peter Lemon's excellent RPI examples. Apparently the dude is hand assembling shaders. By hand. I've done a few myself and its quite fun, but eventually for the initial release of libJNM I'm going to have to write an assembler. Or a compiler (JSSL or Particle).

Great, so now we have built two commands buffers. How do we submit them? The GPU has two registers per thread, one for the command buffer start address and another for the end address, and execution continues as long as the start address and end address are not equal. For binning thread 0, these are CT0CA and CT0EA respectively. Likewise for render thread 1, it's CT1CA and CT1EA.

There are a few ways of synchronizing the threads. If you're lazy you can have the CPU wait on BMFCT which is incremented when the binning thread flushes all tile lists to memory. When when the count is right you can let the CPU go on to kick the rendering thread. On the other side RMFCT is incremented whenever the last tile store is completed. Making the CPU spin wait on these is probably a very bad idea for performance. Slightly better is kicking the rendering thread when a binning flush interrupt happens. An even better way is to use semaphores (see above). There seem to be two front end semaphores, one that the render thread waits on and the binning thread increments, and another that the binning thread waits on and the render thread increments. This is a great way to stop either thread from getting too far ahead of the other. There are also markers, but that's a topic for another day.

Finally, since we discussed the command buffers we ourselves write, I wanted to quickly explain the command buffers generated by the binning thread that our render command buffer calls. This was a bit of a mystery so I wrote a command buffer disassembler to get a better idea of what's in there.

To avoid making you read all that, I have color coded the interesting bits in green and red. The green parts reflect exactly what was said towards the middle of the post: that the rendering thread, for each tile, must specify a tile XY with the Tile_Coordinates command, followed by a jump to the generated command buffer, a return from that command buffer, and a tile store. The red bits are interesting because this answers what happens when you might need more than 32 bytes. The above example shows three triangles converted into an indexed list of compressed primitives by the PTB.  Compressed primitive lists can contain not only 16+16 bit XY verts and 16/24 bit indices, but also jumps to other locations. The 0x45 branch target specified is a PC-relative offset, and is a multiple of 32 bytes. The escape code 0x80 specifies the termination of the primitive list. For a full list of compressed primitive codings and a full list of PTB render/binning thread commands, see the Videocore manual at https://www.broadcom.com/docs/support/videocore/VideoCoreIV-AG100-R.pdf. This blog post was meant to be a high level overview, so make sure to RTFM if you want to know what all my setup commands mean.

So what's next? First, I'd like to turn my extremely elementary understanding of the chip into an elementary understanding of the chip. Maybe do a deep dive on the GL pipeline and the QPU. Also, once I get a better handle on things, I have to decide whether to be realistic and write a driver that fits the architecture well, or do I go crazy and challenge myself to implement my favorite GCN features in really hacky ways because I can? Maybe a mix of both. That's the fun of home projects where you don't have to be responsible :)

Shoutouts and props handed out to the following people:

Andrew, Colin, and Neil from Codeplay, Graham Wihlidal (GrahamBox system architect), Peter Lemon whose examples and header files saved me hours of manually typing MMIO register offsets, and all the poor guinea pigs who had to proofread this: Tom Forsyth (@tom_forsyth), @RapidGS, Jason Proctor