You don't have to flip your textures for OpenGL

Author: alek-tron.com | alektron1@gmail.com

There are countless questions online of people trying to learn OpenGL, asking why their textures show up upside-down in the final render and what to do about it. The answer is always more or less the same and usually boils down to variations of the following statements:

"OpenGL handles UV coordinates different to other graphics APIs like Direct3D"
"OpenGL has the texture coordinate in the lower left corner"
"Images have their origin in the top left corner"
"Images must be flipped on load" (alternatively: "UV coordinates must be flipped in the shader")

Popular online resources teach the same thing. learnopengl for example states:

[...] OpenGL expects the 0.0 coordinate on the y-axis to be on the bottom side of the image, but images usually have 0.0 at the top of the y-axis

The official OpenGL spec clearly states that when uploading a texture buffer:

The first element corresponds to the lower left corner of the texture image

while the Direct3D documentation states:

The texel coordinate system has its origin at the top-left corner of the texture

Clearly OpenGL differs from Direct3D in that regard. Luckily, libraries for decoding image formats often have options to vertically flip the image data while loading it from disk.

Usually that is the end of it. You flip your texture before uploading it to the GPU or you flip the UV coordinates in the vertex shader. The texture shows up correctly and you move on.

I have a problem with this. While the advice may work at first, the explanation for why it does, is not entirely correct and will lead to confusion when implementing a Direct3D backend later. Let's see what really happens and why you do not have to load your texture data differently for different graphics APIs.

A counter example

As a first step, let's ignore the discussion of where exactly the texture origin is or isn't and see if we can write a sample application that renderes the exact same textured rectangle once in OpenGL and once in Direct3D. If the online resources are correct, then we should see a clear difference. The full C++ code (including a 3D version) can be found on GitHub

Let's define the vertices that make up our simple quad

struct Vec2
{
  float x = 0;
  float y = 0;
};

struct Vertex
{
  Vec2 Position;
  Vec2 TexCoord;
};

const Vec2 UPPER_LEFT  = { -1, +1 };
const Vec2 UPPER_RIGHT = { +1, +1 };
const Vec2 LOWER_LEFT  = { -1, -1 };
const Vec2 LOWER_RIGHT = { +1, -1 };
Vertex2D vertices[] = {
  { LOWER_LEFT , { 0, 1 } },
  { LOWER_RIGHT, { 1, 1 } },
  { UPPER_RIGHT, { 1, 0 } },
  { UPPER_RIGHT, { 1, 0 } },
  { UPPER_LEFT , { 0, 0 } },
  { LOWER_LEFT , { 0, 1 } },
};

Note that we compose our quad of two triangles. This way we don't have to use an index buffer and can keep the code even more simplistic. Also note that the texture coordinate (0, 0) is in the upper left corner. I will not show the whole OpenGL/Direct3D code here, you can refer to the full code sample for that. Suffice to say that there are no vertex transformations (except for a very simple view matrix) or UV flippings taking place in either implementation.

In the shader we will then render the quad with the output color's red and green channels set to the values of the UV coordinates of each vertex. For GLSL this would look something like this:

#version 330 core
 
in vec2 i_TexCoord;
out vec4 OutColor;

void main()
{
  OutColor = vec4(i_TexCoord, 0, 1);
}

The HLSL version looks very similar. When rendering this quad, both rendering APIs produce the same result:

As expected, the upper left corner - where the UV coordinates are (0, 0) - is black, while the lower right corner is yellow (1, 1). We can now be sure that both graphics backends render each vertex, and therefore each UV coordinate, in the same screen space location.

Keep in mind that the vertex positions are somewhat arbitrarily chosen. One could say we are following D3D convention here by putting the origin in the upper left corner. However the vertex positions are not really relevant to the problem at hand. We could just flip them without touching UVs or texture data. The result would then of course be flipped but what matters is that it would still be the same for both APIs.

Now to load a texture, we will use the popular header-only library stb_image and the following texture created with Blender:

We then upload this texture to the GPU with glTexImage2D and CreateTexture2D respectively. Once again, we are not doing anything fancy to flip the the image data. There is a flag in stb_image that can be set via stbi_set_flip_vertically_on_load which we are NOT going to use. In both shaders we then directly feed the UV coordinates into the texture sampler. For OpenGL the main function now looks like this:

void main()
{
  OutColor = vec4(texture(u_Texture, i_TexCoord).xyz, 1);
}

This produces the following result:

As we can see both rendering APIs produce the same result despite us not flipping any data. Are we not violating OpenGL convention? Well, yes and no. Let's look into it.

Image files and memory layout

To figure out what is going on, we have to understand how sampling a texture works under the hood.

Let's first take a look at how images are stored in memory. For simplicity's sake we will use this simple 4x4 pixels big image as an example:

Images are (usually) two dimensional, computer memory however is one dimensional. It is just one big array of bytes. To store our image in memory we first have to answer the question of how to layout our 2D pixels in 1D. Traditionally one starts with either the top-left or bottom-left corner and lays out the pixels in rows. Starting at the top-left for example we would end up with this:

While starting at the bottom left would end up like this:

There is one very subtle but important thing to note here. I used the terms 'top-left' and 'bottom-left' and I hope we can all agree that the top-left pixel is black, while the bottom-left pixel is green. When an image is presented to us as humans on a computer screen, we have no problem interpreting and communicating what is the 'top' and what is the 'bottom'.

As soon as the image is laid out one dimensional in memory, that property disappears. If I only present you one of the representations above without any additional information, you would not be able to tell unambigously which pixel belongs in the top left corner of the image. And neither can your computer.

From this point onward, the terms 'top-left' or 'bottom-left' lose their meaning until we present the image on the screen again. For everything in between we have to rely on convention.

The stb_image library documentation states that images are loaded in such a way that the pixel buffer starts with the top-left pixel and then continues row by row from left to right, top to bottom. Different image file formats may or may not use different conventions to store the data on disk but fortunately for us, stb_image already takes care of that and guarantees us a certain memory layout regardless of file format.

This perfectly fits our first memory layout so for the rest of the example this is the one we will use.

Coordinate systems and conventions

This pixel array gets now sent to the GPU. If we stick to the conventions and visualize the data as such it would look like this.

Since we did not flip our pixel data, the image shows up flipped in the OpenGL coordinate system. This is what everyone is talking about when they say that OpenGL's origin is in the lower left corner. By not flipping the image data we technically went against the OpenGL convention. We will see why that is not as much of a problem as you might think in a second.

There is even an argument to be made that the coordinate systems are not at all different when it comes to sampling. D3D might grow downwards but the coordinates are still defined as positive so you could esentially just mentally flip the whole thing and end up the same as OpenGL.

Keep in mind that this mental 'trick' only works for textures uploaded form the CPU. Later we will take a look at framebuffer textures and see why they are a bit of a different story.

Sampling the texture

Despite our little mental exercise we are not actually going to flip any coordinates but when looking at the sampling process it becomes obvious why that is not even necessary. The black pixel (our intial 'top-left') is still at (0, 0) in both coordinate systems. The same goes for the yellow pixel at (1, 1). So both APIs will sample the texture the same way.

Again, keep in mind that we might as well have choosen our UVs differently, say with (0, 0) being in the lower left corner on screen. That is what learnopengl does in their example. Of course our final result would then be flipped. When only writing an OpenGL backend you might then be inclined to believe that you have to flip the textures.

But we can now see that the exact same would happen for D3D. The texture sampling is not dependent on the vertices' screen space positions. In fact, in a real application (especially in 3D), you usually apply various transformations to your vertices before they get sampled. But an uspide down camera/view matrix or a model transform does not influence texture space.

Violating conventions

Yes, we are violating OpenGL convention by not flipping the texture but what are the alternatives? Let's say we do flip the texture for OpenGL only, as everyone says we should. We now get different results for our graphics backends. To fix it we have to either flip the texture for D3D as well or we flip the UVs. What is that? Are we now going against the D3D conventions? Exactly!

If we want to write graphics API agnostic code we have to violate some conventions. By flipping the texture you are deciding to add extra logic in your OpenGL backend to follow convention but then also add extra locic to D3D to make it fit. While I decide to do nothing and it all falls into place.

But what about 3D?

I have recently added a 3D version to my sample code to show that everything I have talked about applies to 3D just the same.
The only difference between the two backends in that sample is the projection matrix (the view matrix is the same for both), which is indeed slightly different for OpenGL vs D3D. That does however in no way influence texture sampling.

Not everything is that simple

All this being said, the differences in coordinate system conventions do make a difference in other areas. For example using a texture that was not loaded from a file but instead created by directly rendering to a texture. Imagine we rendered our 4x4 pixel square not directly to the default screen buffer but a background texture that we later want to sample from.
With our (0, 0) vertex in the upper left corner again we would end up with the following textures in GPU memory.

Of course OpenGL and D3D both follow their own conventions correctly and render the final image in their respective coordinate systems. If we now naively sample texel (0, 0) again (either from a shader or e.g. via `glReadPixels`), we will indeed get different results. A green pixel with OpenGL and the black pixel with D3D.

3D Tools

There is yet another variable to keep in mind that we have not talked about: 3D Modeling tools. Meshes are usually not created manually like the quad in our example.

Whether you use Blender, 3DS Max or any other tool, they usually come with a UV editor. And of course this editor once again has its own coordinate system, which may or may not be standardised (Blender and 3DS Max both have the (0, 0) coordinate in the lower left corner).

Depending on your (or your teams) workflow(s) and choice of tools you might again need options to flip UVs and textures occasionally. But not because of your choice of graphics API.

Other sources

While writing this, I also stumbled upon this article from Stewart Lynch, basically describing the same thing. It does not go that much into detail but is also worth a read.