Writing a custom vector graphics renderer - part 3

Martin Fouilleul  —  1 month ago [Edited 2 hours, 12 minutes later]
Hello !

Here's the third part of my description of Top !'s vector graphics renderer. It will discuss the practical aspects of passing down data to the GPU with OpenGL and trying to optimize the performance of the renderer. (it will not explain all the steps to setup and manage OpenGL, so some familiarity with OpenGL API and GLSL may be required).

In Part 1 we saw how paths were built with GraphicsMoveTo(), GraphicsLineTo() and GraphicsCurveTo() functions, and how path stroking was implemented in GraphicsStroke().
In Part 2 we discussed polygon triangulation algorithms and how they were used to fill a path with GraphicsFill().
Our vector graphics system also exposes function to create fonts and gradients, to apply transformations, and to limit drawing to a given area defined by a clipping path. We briefly discuss them in the following paragraphs before addressing the OpenGL code.

Transformations

The graphics context maintains a stack of transformation matrices. The matrix on top of the stack is used to translate/scale/rotate paths. The two following functions are used to push/pop matrices on the stack :

1
2
void GraphicsPushMatrix(GraphicsContext* context, const Matrix3& matrix);
void GraphicsPopMatrix(GraphicsContext* context);


Note that GraphicsPushMatrix() multiplies its matrix argument with the current (ie. top of the stack) matrix before pushing it on the stack.
Vertex transformation occurs in our shader, so we use a uniform to pass our current transformation matrix to the shader before drawing, with the help of the following function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
void GraphicsSetShaderTransform(GraphicsContext* context, const Matrix3& t3)
{
	Matrix4 t4 = Matrix4(t3(1,1), t3(1,2), 0, t3(1,3),
			     t3(2,1), t3(2,2), 0, t3(2,3),
			     t3(3,1), t3(3,2), 1, 0,
			     0	    , 0      , 0, 1);

	Matrix4 m4 = context->coordSystem*t4;

	glUniformMatrix4fv(context->transformLoc, 1, true, m4.m_data);
}


coordSystem is a matrix that is setup to transform from window coordinates (with origin at top-left corner, y pointing downard) to OpenGL normalized device coordinates.

Fonts

As if I wasn't already in debt to Sir Sean Barett after so many hours playing Thief: the dark project in my young days, it turns out that stb libraries are just awesome : public domain, header only, easy to integrate and use, entirely readable and transparent and giving you total freedom over data and memory management.

It was a breeze to get font textures and metrics working, and to write a naive text rendering function using textured quads. I could (and will) put some more work here, regarding font scaling, kerning, etc. but that's really all there is to it for now : you get a big alpha texture with all you glyphs at a given size, and you render text by drawing quads with the corresponding UV coordinates, using the alpha value of the texel to multiply the alpha value of the current color.

Gradients

We saw in Part 2 that the filling algorithm computed UV coordinates corresponding to each triangle vertex, which allows to fill a path not only with a solid color, but also with a texture. The result is as if we scaled the texture image to fit the bounding rectangle of our path, and draw the image pixels only if they're inside the polygon defined by the path. This is used mostly with gradient textures :

1
2
3
4
5
6
GraphicsGradient* GraphicsGradientCreateLinear(GraphicsContext* context,
					       const Vector2& v,
					       const Color& start,
					       const Color& stop);

void GraphicsSetGradient(GraphicsContext* context, GraphicsGradient* gradient);


The function GraphicsCreateGradientLinear() creates a texture image corresponding to a linear gradient from a start color to a stop color along a given vector. We can then set this texture current to the graphics context by using GraphicsSetGradient(), so that it will be used in subsequent calls of GraphicsFill().
We will see later the implementation of GraphicsCreateGradientLinear(), once we have described the way we handle textures.

Clipping

Here's one of the features whose implementation i'm not satisfied with at the moment, for reasons we will see in a few paragraphs, but here it is : we need a way to restrict drawing to arbitrary (ie. path-defined) regions of the window, in order to ease the drawing of nested views, scrollable areas, etc, by the GUI system.
As such, we define a clip region, which will initialy cover the entire window, and can be restricted or expanded by the user's drawing code. We expose two functions for this purpose :

1
2
void GraphicsPushClip(GraphicsContext* context);
void GraphicsPopClip(GraphicsContext* context);


The function GraphicsPushClip() restricts the clip region to the intersection of the current path with the current clip region, whereas GraphicsPopClip() reverts the clip region to its previous state.

The easy way to do this is to use the OpenGL stencil buffer. During normal drawing operations, the stencil test is set to pass where the stencil value is zero and fail otherwise.
In GraphicsPushClip(), we first increment all values in the stencil buffer. Then we temporarily set the stencil test to fail on ones and pass otherwise, and the stencil operation to zero the stencil value upon failure. We then fill the current path, thus zeroing the intersection of the previous clip region (were values are now all ones) and the current path.



Fig. 1 : Pushing a clip path

To pop the clipping, we only have to decrement all values in the stencil buffer (without wrapping, ie. zeros will not be modified).



Fig. 2 : Poping a clip path

Here is the definition of GraphicsPushClip() and GraphicsPopClip() :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
void GraphicsPushClip(GraphicsContext* context, graphics_path* path)
{
	glDisable(GL_DEPTH_TEST);
	glColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE);
	glDepthMask(GL_FALSE);

	glStencilFunc(GL_NEVER, 0, 0xFF);
	glStencilOp(GL_INCR, GL_KEEP, GL_KEEP);
	glStencilMask(0xFF);

	vertex_attributes attr[] = {{Point3H(-1, 1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(-1, -1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(1, -1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(-1, 1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(1, -1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(1, 1, 0), 0, 0, SHADER_MODE_COLOR}};

	glUniformMatrix4fv(context->transformLoc, 1, true, Matrix4::Id().m_data);

	GraphicsGLPushVertexData(context, 6, attr);
	GraphicsGLDrawTriangles(context);

 	//NOTE(martin): at this point we have incremented all the stencil buffer
	//		we do another pass to zero the intersection of one's and our path

	glStencilFunc(GL_NOTEQUAL, 1, 0xff);
	glStencilOp(GL_ZERO, GL_KEEP, GL_KEEP);

	GraphicsGLFillPath(context, path);

	//NOTE(martin): reset shader transform
	GraphicsGLSetShaderTransform(context, context->transform);
	GraphicsGLDrawTriangles(context);

	//NOTE(martin): now we reset masks and stencil test :

	glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
	glDepthMask(GL_TRUE);
	glEnable(GL_DEPTH_TEST);
	glStencilMask(0x00);
	glStencilFunc(GL_EQUAL, 0, 0xff);
}

void GraphicsPopClip(GraphicsContext* context)
{
	glStencilMask(0xff);
	glDisable(GL_DEPTH_TEST);
	glColorMask(GL_FALSE, GL_FALSE, GL_FALSE, GL_FALSE);
	glDepthMask(GL_FALSE);

	glStencilFunc(GL_NEVER, 0, 0xFF);
	glStencilOp(GL_DECR, GL_KEEP, GL_KEEP);

	vertex_attributes attr[] = {{Point3H(-1, 1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(-1, -1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(1, -1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(-1, 1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(1, -1, 0), 0, 0, SHADER_MODE_COLOR},
				    {Point3H(1, 1, 0), 0, 0, SHADER_MODE_COLOR}};

	glUniformMatrix4fv(context->transformLoc, 1, true, Matrix4::Id().m_data);

	GraphicsGLPushVertexData(context, 6, attr);
	GraphicsGLDrawTriangles(context);

	//NOTE(martin): reset shader transform
	GraphicsGLSetShaderTransform(context, context->transform);

	//NOTE(martin): Reset masks and stencil test :

	glColorMask(GL_TRUE, GL_TRUE, GL_TRUE, GL_TRUE);
	glDepthMask(GL_TRUE);
	glEnable(GL_DEPTH_TEST);
	glStencilMask(0x00);
	glStencilFunc(GL_EQUAL, 0, 0xff);
}


While this implementation is relatively straightforward, we will see in a moment why it is far from ideal in our practical use case.


First naive approach and performance problems


In a very first implementation of the renderer, we could pass vertex data and request OpenGL to draw them as soon as our stroking/filling functions have computed the needed triangles. That approach would interleave the immediate mode graphical interface logic with the drawing logic, and would not be very efficient in cases were we realize after the fact that we didn't need to draw some portions of the GUI. That's why, instead of directly calling our drawing code, we merely record our drawing requests into a command buffer, and actually call our drawing code when the context is flush by a call to GraphicsFlush().

For instance, calling GraphicsStroke() will not immediately compute triangles and send them to the GPU, but rather append a command to a buffer. When GraphicsFlush() is called, it will loop through the entire buffer and eventually call GraphicsGLStroke() which is the name of the actual stroking function we discussed in Part 1.

This way we avoid one of the disadvantages of naïve immediate mode GUI implementation, where the GUI sometimes has to be drawn twice per frame due to a GUI event changing the look of the GUI. Here, we just discard the previous command buffer. Still, our triangles are passed to the GPU and rendered as soon as they are computed, with calls to glBufferData() and glDrawArrays().

Our shader code would be something along the lines of :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
//----------------------------------------------------------------------
//vertex_shader.glsl
//----------------------------------------------------------------------
#version 410 core

in vec4 vertexPos;
in vec2 uvCoord;

out vec4 position;
out vec2 texCoord;

uniform mat4 transform = mat4(1.0);

void main()
{
	texCoord = uvCoord;
	gl_Position = transform*vertexPos;
}


//----------------------------------------------------------------------
//fragment_shader.glsl
//----------------------------------------------------------------------
#version 410 core

#define SHADER_MODE_COLOR	0
#define SHADER_MODE_TEXTURE	1
#define SHADER_MODE_FONT	2

in vec2 texCoord;
layout(location=0) out vec4 fragColor;

uniform vec4 globalColor;
uniform sampler2D texSampler;
uniform sampler2D fontSampler;

uniform int mode = SHADER_MODE_COLOR;

void main()
{
	if(mode == SHADER_MODE_COLOR)
	{
		fragColor = globalColor;
	}
	else if(mode == SHADER_MODE_TEXTURE)
	{
		fragColor = texture(texSampler, texCoord);
	}
	else if(mode == SHADER_MODE_FONT)
	{
		vec4 texColor = texture(fontSampler, texCoord);
		fragColor = globalColor;
		fragColor.a = fragColor.a*texColor.r;
	}
}


Measuring our renderer's performance

On top of our renderer in an immediate mode GUI which handles the interactions of the user with Top ! cuelist system. The maximum amount of work the GUI could push to the renderer is dependent on the cuelist size, because more cues means that we need to display more text fields, buttons, matrix cells and so on. To measure our renderer performance in a close-to-real-world scenario, we load a cuelist that will fill the entire usable GUI space (ie. the cuelist and matrix views are full), and draw all the GUI each frame.

We will see at the end that this is a worst case because in most circumstances we need to draw only parts of the GUI, and in fact most of the time we don't need to redraw the GUI at all. Still, we need to ensure that if the GUI must be entirely redrawn for whatever reason (eg. if the window is resized, maximize, etc...), that task can be achieved in some reasonable amount of time so that the user won't notice a lag. We are not aiming for 60fps of a constantly changing world here. An occasional 50ms to draw a frame would be ok (of course, if we can do better it's all good).

The first iteration of our rendering code, while simple to reason about, gave extremely poor performance results. It achieved a frame rate of roughly 3 FPS (approximately 333ms/frame) and waisting a lot of CPU time. Using the profiler and the GPU driver monitor enabled to identify the following bottlenecks :

  • 30% of the total time was spent with the CPU waiting for the GPU (this is indicated by the GPU driver monitor tool but is not considered by the profiler itself).
  • 32% (approximately 70ms) of the (remaining) time was spent inside glBufferData_Exec().
  • 20% (approximately 44ms) was spent inside glDrawArrays_ACC_GL3Exec().
  • 15% (approximately 33ms) was spent allocating and freeing memory inside drawing functions.

For clarity, let's first eliminate the last minor bottleneck, ie. memory management, even if in a real scenario you would likely try to address it after having dealt with the first three.


Addressing memory allocations overhead

We already saw in Part 1 how the malloc()/free() operations were replaced inside the drawing functions by a linear allocation scheme, so this is just a reminder : at initialization, the graphics context allocates a "big" block of memory (1MB) that it uses to store the data structures needed by the path construction and triangulation functions. It maintains an offset in the buffer pointing to the first "free" byte, and simply returns the address of that byte and increments the offset whenever a memory block is requested by the use of the allocation function :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
void* GraphicsScratchAlloc(GraphicsContext* context, uint32 size)
{
	if(context->scratchOffset+size >= GRAPHICS_SCRATCH_BUFFER_SIZE)
	{
		return(0);
	}
	else
	{
		//NOTE(martin): align block size on 16 bytes
		void* ptr = (void*)(context->scratchBuffer + context->scratchOffset);

		size = (size + 15) & ~0x0f;
		context->scratchOffset += size;
		return(ptr);
	}
}


The only subtlety here is that the blocks are all aligned on 16 bytes boundaries to ease some SIMD optimizations on our vector type.



Addressing vertex submission overhead


There is two kind of overhead to consider regarding our OpenGL code :

  • Time which is spent by the CPU inside OpenGL API functions.
  • Time which is spent by the CPU waiting for the GPU to complete some operation.

Submitting vertex data in larger batches

To reduce the overhead due to OpenGL API calls, the idea is to gather larger chunks of vertex data and send them in batch, using only a few calls. However, each batch should have the same "state", ie use the same values for our shader uniforms. That means that each time we change the transformation matrix, the color, or the texture, we must call OpenGL to draw the data we gather so far, then start collecting a new batch...

So we want to reduce the number of shader state changes to allow us to send larger batches. In particular, we want to be able to use multiple textures inside the same batch, without having to modify texture units. That's where a Texture Atlas comes in handy : Instead of creating a texture for each gradient, we create a big texture at initialization time, and allocate some rectangular area inside this texture each time we need to create a gradient. In our vertex attributes data, we adjust the texture UV coordinates so that they map to the correct area of the atlas.

We use a super simple (and somewhat wasteful) packing scheme for the texture atlas : we allocate regions side by side on horizontal lines, and jump to a new line each time we can't find enough space at the end of the current line. We try to reuse deallocated regions but don't try to prevent fragmentation. For now it's a reasonable assumption, in the context of a GUI, that most of the textures will be allocated once and be kept in the atlas until the end of the program.

The texture atlas consists in the following members inside our graphics context structure :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
	//(...)

	uint32 atlasTexture;
	Point2 atlasFreeOffset;
	float atlasMaxLineHeight;
	struct atlas_area
	{
		list_info list;
		graphics_rect rect;
	};
	list_info atlasFreeAreas;

	//(...)


The member atlasTexture is an OpenGL texture handle created by glGenTextures(). atlasFreeOffset is the next position at which we could try to allocate space for a new texture. atlasMaxLineHeight is the maximum height of all the areas on the current line. atlasFreeAreas is a list of areas that were previously marked as "deallocated" and can be reused for future textures.

Our gradient creation function, GraphicsGradientCreateLinear(), first try to allocate a region from the texture atlas as follows :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
	//(...)

	//NOTE(martin): find an empty area in the texture atlas
	bool found = false;
	for_each_in_list(&context->atlasFreeAreas, area, GraphicsGLContext::atlas_area, list)
	{
		if(width <= area->rect.w && height <= area->rect.h)
		{
			gradient->area = area->rect;
			found = true;
			ListRemove(&area->list);
			delete area;
			break;
		}
	}
	if(!found)
	{
		if(context->atlasFreeOffset.x + width >= GRAPHICS_TEXTURE_ATLAS_WIDTH)
		{
			context->atlasFreeOffset.x = 0;
			context->atlasFreeOffset.y += context->atlasMaxLineHeight;
			context->atlasMaxLineHeight = 0;
		}
		gradient->area = MakeRect(context->atlasFreeOffset.x, context->atlasFreeOffset.y, width, height);
		context->atlasFreeOffset.x += width;
		context->atlasMaxLineHeight = maximum(context->atlasMaxLineHeight, height);
	}

	//(...)


It then proceeds to draw a color gradient inside the allocated region :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
	//(...)

	uint32 frameBuffer;
	glGenFramebuffers(1, &frameBuffer);
	glBindFramebuffer(GL_FRAMEBUFFER, frameBuffer);

	glBindTexture(GL_TEXTURE_2D, context->atlasTexture);

	glFramebufferTexture(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, context->atlasTexture, 0);
	GLenum drawBuffers[1] = {GL_COLOR_ATTACHMENT0};
	glDrawBuffers(1, drawBuffers);

	uint32 ret;
	if((ret = glCheckFramebufferStatus(GL_FRAMEBUFFER)) != GL_FRAMEBUFFER_COMPLETE)
	{
		return(ret);
	}

	glBindFramebuffer(GL_FRAMEBUFFER, frameBuffer);
	glViewport(gradient->area.x, gradient->area.y, width, height);

	Vector2 diag(width, height);
	Vector2 n = diag/diag.Norm();
	float d = Vector2(width, 0).Dot(n)/diag.Norm();

	Color bottomLeftColor = Color(d*stop.r + (1-d)*start.r,
				      d*stop.g + (1-d)*start.g,
				      d*stop.b + (1-d)*start.b,
				      d*stop.a + (1-d)*start.a);

	Color topRightColor = Color((1-d)*stop.r + d*start.r,
				      (1-d)*stop.g + d*start.g,
				      (1-d)*stop.b + d*start.b,
				      (1-d)*stop.a + d*start.a);

	uint32 packedStart = PackColor(start);
	uint32 packedStop = PackColor(stop);
	uint32 packedBottomLeft = PackColor(bottomLeftColor);
	uint32 packedTopRight = PackColor(topRightColor);

	vertex_attributes attr[] = {{Point3H(-1, 1, 0, 1), packedStart, 0, SHADER_MODE_COLOR},
				    {Point3H(1, 1, 0, 1), packedTopRight, 0, SHADER_MODE_COLOR},
				    {Point3H(1, -1, 0, 1), packedStop, 0, SHADER_MODE_COLOR},
				    {Point3H(-1, 1, 0, 1), packedStart, 0, SHADER_MODE_COLOR},
				    {Point3H(1, -1, 0, 1), packedStop, 0, SHADER_MODE_COLOR},
				    {Point3H(-1, -1, 0, 1), packedBottomLeft, 0, SHADER_MODE_COLOR}};

	glUniformMatrix4fv(context->transformLoc, 1, true, Matrix4::Id().m_data);

	GraphicsGLPushVertexData(context, 6, attr);
	GraphicsGLDrawTriangles(context);

	glBindFramebuffer(GL_FRAMEBUFFER, 0);

	float32 backingScaleX, backingScaleY;

	GraphicsNSGLBackendGetBackingScale(context->backend, &backingScaleX, &backingScaleY);
	glViewport(0, 0, context->windowWidth*backingScaleX, context->windowHeight*backingScaleY);

	glDeleteFramebuffers(1, &frameBuffer);


In this snippet, we first create a frame buffer attached to our texture atlas, and set the viewport to draw in the previously allocated area. Then we compute the colors for each corner of the area and draw a quad over the entire viewport. We finally destroy the frame buffer and reset the viewport to its original state.

Now let's talk about our vertex attribute format, that we just saw in use in the previous code snippet. In order to maximize the size of batches that we can send to the GPU without uniform or shader changes, we use a common vertex attributes format for all our primitives :

1
2
3
4
5
6
7
8
struct vertex_attributes
{
	Point3H position;
	uint32  packedColor;
	uint32	packedUV;
	uint8	mode;
	uint8	padding[3];
}__attribute__((packed, aligned(16)));


This structure contains the position of the vertex, its color, its UV coordinates in the texture atlas, and a shader mode. Note that we use 32 bit integers to pack our Color and UV attributes. Once these attributes have been computed using 32 bits floating point precision for each component, our shader doesn't do much more computation on them. We can afford to loose some precision in there, so we can spare a lot of data transmission overhead and get a big speed-up.

The mode can be one the following values :

  • SHADER_MODE_COLOR (0x01) : the fragment shader uses the (interpolated) color attribute to determine the color of the fragment.
  • SHADER_MODE_TEXTURE (0x01<<1) : the fragment shader uses the texture atlas and UV coordinates to determine the color of the fragment.
  • SHADER_MODE_FONT (0x01<<2) : the fragment shader uses a combination of the color, the UV coordinates and the font texture to determine the color of the fragment.

The fragment shader code uses masking to select the appropriate operation :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
	//(...)

	vec4 texColor = texture(texSampler, texCoord);
	vec4 fontAlpha = texture(fontSampler, texCoord);
	vec4 fontColor = shadedColor;
	fontColor.a = fontColor.a*fontAlpha.r;

	fragColor = shadedColor*(mode & SHADER_MODE_COLOR) + texColor*((mode & SHADER_MODE_TEXTURE)>>1) + fontColor*((mode & SHADER_MODE_FONT)>>2);

	//(...)


Now that we have a uniform way to send vertex data to the GPU while using different shading modes and textures, we can modify our drawing code to gather vertex data and send it in large batches to the GPU. To this effect we define two functions :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
void GraphicsGLPushVertexData(GraphicsContext* context,
			      uint32 count,
			      vertex_attributes* data)
{
	if(context->vertexCount + count < GRAPHICS_VERTEX_BUFFER_COUNT)
	{
		GraphicsGLDrawTriangles(context);
	}
	glBufferSubData(GL_ARRAY_BUFFER, context->vertexCount*sizeof(vertex_attributes), sizeof(vertex_attributes)*count, data);
	context->vertexCount += count;
}
void GraphicsGLDrawTriangles(GraphicsContext* context)
{
	glDrawArrays(GL_TRIANGLES, 0, context->vertexCount);
	glBufferData(GL_ARRAY_BUFFER, GRAPHICS_VERTEX_BUFFER_COUNT*sizeof(vertex_attributes), 0, GL_STREAM_DRAW);
	context->vertexCount = 0;
}


GraphicsGLPushVertexData() appends a buffer of vertex_attributes structures of length count to the vertex buffer. Each time we modify a uniform of our shaders, or when the temporary buffer is full, we call GraphicsGLDrawTriangles() to send the data to the GPU and draw the triangles, using glDrawArrays().
We then orphan the buffer using glBufferData(), indicating to OpenGL that we need a fresh buffer to write to. Otherwise, we would block inside GraphicsGLPushVertexData() when we call glBufferSubData(), until OpenGL has finished processing the vertex buffer.


Results of batch submission

So far we managed to reduce the cost of OpenGL calls to a much smaller amount of time and reached a frame rate of approximately 17fps (59ms/frame).

  • There was still around 20% of the time wasted in waiting for the GPU to complete tasks.
  • 29% of the remaining time (ie 13.6ms) was spent inside glBufferData_Exec().
  • 0.7% (0.33ms) was spent inside glDrawArrays_ACC_GL3Exec().

The problem here is that our calls to glBufferData() and glBufferSubData() must still do a bunch of work copying our vertex data and will eventually block to wait for the GPU. Not knowing the implementation details of the driver, it is hard to tell why it is blocking, but there are a few possible explanations I can only speculate (if someone is a graphics driver specialist, feel free to chime in and tell me if those are correct or completely wrong guesses ! There's not a lot of detailed info on that subject online and I would love to hear about it) :

  • First of all, when we orphan our buffer by calling glBufferData() and passing it a NULL pointer, the driver is free to give us a new buffer and allow us to write in it while the previous buffer is still in flight. But it is not mandatory at all. In some case it could even just block until the drawing is finished, and return the same buffer. We don't know.
  • In some cases the driver might not have a free buffer right at hand, and need to allocate it for us, which may take some time.
  • Each time we update our buffer with glBufferSubData(), the driver might have to check that there are no drawing commands currently touching this buffer, which might incur an implicit lock.
  • Orphaning will likely increase the amount of work the driver has to do behind the scenes : it has to keep track of in flight buffer, have some kind of "garbage collection" of orphaned buffers, etc... which could slow down the rest of its tasks and eventually stall.


Streaming data to the GPU

A second technique we can use to improve our vertex submission rate and avoid waiting for the GPU is buffer object streaming\s:

  • We pre-allocate a large vertex buffer and maintain a vertex counter to hold the number of vertices we pushed to the buffer. We also keep track of the start of the batch of vertices we are currently gathering.
  • When we push our vertex data, we map a subrange of the buffer into our process space, starting after the last vertex we wrote, and sized according to the data we want to write. We then write our data in the mapped memory, update our vertex counter and unmap the buffer.
    The important thing here is that we map the buffer using the GL_MAP_UNSYNCHRONIZED_BIT and GL_MAP_INVALIDATE_RANGE_BIT flags :
    • The first one ensures that the GL won't do any synchronization on pending operations on the buffer, which is what allows us to write to it without blocking while previous drawing commands are in flight.
    • The second flag indicate that the previous content of that part of the buffer may be discarded, and that we don't need the mapped memory to be initialized, since we will update the entire range. The driver is thus free to give us a pointer to unitialized memory without doing any data synchronization.
      • When we want to draw the content of the buffer, we set the vertex attribute pointers according to the offset of the current batch and issue a drawing command. Then we update the batch offset to be equal to the vertex count (thus pointing to the start of the next batch).
      • When there is no more room to copy our data, we simply orphan the buffer as before, and reset our vertex counter and batch offset to zero.

      Our GraphicsGLPushVertexData() and GraphicsGLDrawTriangles() functions now look like this :

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      void GraphicsGLDrawTriangles(GraphicsContext* context)
      {
      	uint64 posOffset = context->batchOffset*sizeof(vertex_attributes);
      	uint64 colorOffset = posOffset + sizeof(Point3H);
      	uint64 uvOffset = colorOffset + sizeof(uint32);
      	uint64 modeOffset = uvOffset + sizeof(uint32);
      
      	glVertexAttribPointer(context->posLoc, 4, GL_FLOAT, GL_FALSE, sizeof(vertex_attributes), (GLvoid*)posOffset);
      	glVertexAttribPointer(context->colorLoc, 4, GL_UNSIGNED_BYTE, GL_TRUE, sizeof(vertex_attributes), (GLvoid*)colorOffset);
      	glVertexAttribPointer(context->uvLoc, 2, GL_UNSIGNED_SHORT, GL_TRUE, sizeof(vertex_attributes), (GLvoid*)uvOffset);
      	glVertexAttribIPointer(context->modeLoc, 1, GL_UNSIGNED_BYTE, sizeof(vertex_attributes), (GLvoid*)modeOffset);
      
      	glDrawArrays(GL_TRIANGLES, 0, context->vertexCount - context->batchOffset);
      	context->batchOffset = context->vertexCount;
      }
      
      void GraphicsGLPushVertexData(GraphicsContext* context,
      			      uint32 count,
      			      vertex_attributes* data)
      {
      	if(context->vertexCount + count >= GRAPHICS_VERTEX_BUFFER_COUNT)
      	{
      		GraphicsGLDrawTriangles(context);
      		glBufferData(GL_ARRAY_BUFFER, GRAPHICS_VERTEX_BUFFER_COUNT*sizeof(vertex_attributes), 0, GL_STREAM_DRAW);
      		context->batchOffset = 0;
      		context->vertexCount = 0;
      	}
      
      	uint32 dataLen = count*sizeof(vertex_attributes);
      
      	void* buffer = glMapBufferRange(GL_ARRAY_BUFFER,
      					context->vertexCount*sizeof(vertex_attributes),
      					dataLen,
      					GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT);
      	memcpy(buffer, data, dataLen);
      	glUnmapBuffer(GL_ARRAY_BUFFER);
      	context->vertexCount += count;
      }
      


      There is a tradeoff to be found regarding the size of the vertex buffer. We want it to be big enough to have several batches fit in, but if it is too big the oprhaning may be too slow, creating a performance degradation every so often.


      Streaming results

      Buffer object streaming completely got rid of the CPU waiting for GPU time, and also significantly reduced the time spent in OpenGL calls. The frame rate we achieved at this point is around 24fps (41ms/frame), but now OpenGL plays a minor part in the total frame time :

      • 0.6% (0.25ms) is spent in glDrawArrays().
      • 0.6% (0.25ms) is spent in glUnmapBuffer().
      • 0.4% (0.16ms) is spent in glMapBufferRange().
      • the amount of time spent in glBufferData() is now negligible.

      The major part of the rendering time is now spent in our stroking and filling algorithms, so that's where we could improve things further, and that will come up in the next months for sure !

      An important thing to note though is that these results rely on our ability to send a large number of triangle to the GPU in one call. Each action that require the shader uniforms or the texture atlas to be changed forces us to send the current batch and start a new one. If called frequently, it will kill all the benefits from this approach.

      That's why I said earlier that I'm not happy with the way the clip area is implemented. Each clipping path push or pop needs to flush the current batch. This was a problem because the GUI massively relied on clipping to draw its widgets and views. I temporarily worked around that by doing more rounded rectangles fills and stroke, and clipping text early with rectangles. But i'm still bothered by this problem. One option would be to have a clipping paths stack and clip the current drawing path with the current clipping path when we stroke or fill. But that would mean transforming both paths to the same coordinate system and we would loose the benefit of doing coordinate transforms inside the shader...



      Dirty rects, or avoiding all that work if possible


      Most of the time, a GUI widget doesn't need to be redrawn each frame. In fact it requires a redraw only if its aspect changed due to user interaction or dynamic styling. While designed in an immediate mode style, the GUI system of Top! caches enough info to determines if a redraw is needed, at the widget level. For instance, when a button is interacted with by the user, it will, in addition to issuing its drawing commands, emmit a "dirty rectangle". Dirty rects are transformed into the window coordinate system and merged by the renderer to form a set of disjoint rectangles, which determines the areas of the window that must be redrawn.

      Note that if the button is not interacted with, it still has to issue its drawing commands, because another widget could emmit a dirty rectangle that overlaps its own area. To preserve a consistent (z-ordered) result, the untouched widget would have to be (partially) redrawn.

      We expose a variant of GraphicsFush() called GraphicsFlushForDirtyRect(). This function serves the same purpose as GraphicsFlush(), that is, running through the command buffer and executing our vector drawing functions. But it takes the dirty rectangles emmited during the last frame into account to skip or clip some commands.

      Each drawing primitive (Stroke, Fill, ...), has an implicit bounding box corresponding to the extents of the current path (in some cases augmented by the maximum of the line width and mitter limit). If the bounding box doesn't intersect any dirty rect, we can skip the primitive entirely. If it's contained in a dirty rect, we can process as usual. But what happens if the bounding box partially overlaps with one (or more) dirty rectangles ? We must draw the primitive but clip the areas that fall outside of dirty rects.

      Here we employ a technique similar to our path-defined clip area, using the stencil buffer :

      • We increment values in the whole stencil buffer.
      • We set our stencil test to always fail, our stencil operation to decrease the stencil value on failure, and we draw each dirty rectangle.
      • We loop through the command buffer, discarding commands that lies completely outside dirty areas, and executing the others as usual.
      • At the end we increase the stencil values in each dirty rect and decrease the while stencil to revert the stencil to its previous state.

      This way, most of the time the rendering function is entirely skipped, and when the GUI needs to be redrawn it's often limited to one small widget bounding box. This allows us to achieve a good display reactivity while consumming very little CPU time.


      The End ('til next time)

      This wraps up this series on Top!'s vector graphics renderer for now ! I may add posts if there are interesting changes or improvements, but now the next topic will likely be the immediate mode GUI itself. And regarding Top! developpement, in the following month I plan to move on and consolidate and refine the cuelist system (which after all is the core of the damn thing !).

      Thanks for reading !

      Martin
Log in to comment