Can this be optimised any further?

Gadgets, computers, hardware, programming, beep, boop, 01010101.
Post Reply
User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Can this be optimised any further?

Post by Zuffy » Sat Apr 22, 2017 3:44 am

UPDATE: see this post: viewtopic.php?f=30&t=257&p=1181#p1181

With a little help from some community members and information here & here I've written a gimp plugin (in Python) which combines two normal maps in a mathematically "correct" fashion. The only issue I have with it is that it's a little slow, combining two 1024x1024 normal maps takes around 8.5 seconds on my PC (i5 760 @ stock clocks), combining two 4096x4096 maps takes ~136s.

The main loop is listed below:
Note: Clamp() is a function I found online.

Code: Select all

for n in xrange(num_pixels):
            if(n % Update == 0):
                gimp.progress_update(float(n) / float(num_pixels))
            #Indices for slicing
            start = n * pixel_size
            end = start + pixel_size

            basePixel = basePixels[start:end]
            detailPixel = detailPixels[start:end]
            destPixel = destPixels[start:end]

            basePixel[0] = ((basePixel[0])-127.5)/127.5
            basePixel[1] = ((basePixel[1])-127.5)/127.5
            basePixel[2] = (basePixel[2])/127.5

            detailPixel[0] = (127.5-(detailPixel[0]))/127.5
            detailPixel[1] = (127.5-(detailPixel[1]))/127.5
            detailPixel[2] = ((detailPixel[2])-127.5)/127.5

            partialResult = ((basePixel[0]*detailPixel[0])+(basePixel[1]*detailPixel[1])+(basePixel[2]*detailPixel[2]))/basePixel[2]

            destPixel[0] = int(clamp(127.5*(((basePixel[0]*partialResult)-detailPixel[0])+1),0,255))
            destPixel[1] = int(clamp(127.5*(((basePixel[1]*partialResult)-detailPixel[1])+1),0,255))
            destPixel[2] = int(clamp(127.5*(((basePixel[2]*partialResult)-detailPixel[2])+1),0,255))
            destPixel[3] = 255
           
            destPixels[start:end] = destPixel
EDIT: I just spotted a minor optimisation which has shaved off ~0.5s for the 1024x1024 case, giving a runtime of 8s. Code snippet updated
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Mon Apr 24, 2017 6:55 am

More improvement:

Code: Select all

for n in xrange(num_pixels):
            if(n % Update == 0):
                gimp.progress_update(float(n) / float(num_pixels))
            #Indices for slicing
            start = n * pixel_size
            end = start + pixel_size

            basePixel = basePixels[start:end]
            detailPixel = detailPixels[start:end]
            destPixel = destPixels[start:end]            
            
            basePixel[0] = (basePixel[0]-127.5)/127.5
            basePixel[1] = (basePixel[1]-127.5)/127.5
            basePixel[2] = basePixel[2]/127.5

            detailPixel[0] = (127.5-detailPixel[0])/127.5
            detailPixel[1] = (127.5-detailPixel[1])/127.5
            detailPixel[2] = (detailPixel[2]-127.5)/127.5

            partialResult = ((basePixel[0]*detailPixel[0])+(basePixel[1]*detailPixel[1])+(basePixel[2]*detailPixel[2]))/basePixel[2]
            
            subPixel0 = 127.5*(((basePixel[0]*partialResult)-detailPixel[0])+1)
            subPixel1 = 127.5*(((basePixel[1]*partialResult)-detailPixel[1])+1)
            subPixel2 = 127.5*(((basePixel[2]*partialResult)-detailPixel[2])+1)            
            
            destPixel[0] = minVal if subPixel0 < minVal else maxVal if subPixel0 > maxVal else int(subPixel0)
            destPixel[1] = minVal if subPixel1 < minVal else maxVal if subPixel1 > maxVal else int(subPixel1)
            destPixel[2] = minVal if subPixel2 < minVal else maxVal if subPixel2 > maxVal else int(subPixel2)
            destPixel[3] = 255

            destPixels[start:end] = destPixel
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
rocketsoup
Posts: 196
Joined: Wed Dec 14, 2016 8:53 pm
Location: Fort Collins, CO
Contact:

Re: Can this be optimised any further?

Post by rocketsoup » Mon Apr 24, 2017 8:57 am

I can't step through it all, but as a general tip I like to work in integers as much as possible. Integers are fast, fast, fast. Floating point is slow, slow, slow. By way of analogy with a similar pixel operation, if I were doing a Photoshop "multiply" operation, the straightforward but suboptimal approach would be (in pseudocode):

Code: Select all

destRed = round((source1Red / 255.0) * (source2Red / 255.0) * 255.0)
But I can do the same thing with fewer mult/div operations and no floating point operations with:

Code: Select all

destRed = source1Red * source2Red / 255
(I hope I did that right.)

For your code it looks like you're working with floats in the range of -1.0 to +1.0. I wonder if you could operate on integers in the range of -128 to 127 instead, minimizing conversions to floating point as much as possible, usually by delaying division until the very end. It's harder to write because your intermediate values aren't as intuitive, but it's less work for the processor.
Forum moderator/admin - @rocket_soup - Rocket soo :feliciaOoh: ooup!

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Mon Apr 24, 2017 11:25 am

I had considered using scaled integers rather than floats, I just need more motivation to do so. :)

And yes, the values start off as 0 to 255, read from the image layers, these are then converted to the range 0 to 1 which is then converted to -1 to 1. At least, initially it was like that, but looking at it, the various optimisations I've made skip the intermediate step.

Also, the "Pixels" arrays are 1-dimensional, every four elements is a pixel (RGBA).

Reducing the number of divisions would be nice, so I'll start there I think.
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
rocketsoup
Posts: 196
Joined: Wed Dec 14, 2016 8:53 pm
Location: Fort Collins, CO
Contact:

Re: Can this be optimised any further?

Post by rocketsoup » Mon Apr 24, 2017 12:02 pm

The other thing is this looks like it could be written as three simple formulas. I would suggest writing them out on paper and doing some good old fashioned algebraic simplification. It'll be much more apparent what operations can be canceled out.
Forum moderator/admin - @rocket_soup - Rocket soo :feliciaOoh: ooup!

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Tue Apr 25, 2017 6:32 am

This doesn't work correctly, either I've made a mistake somewhere, or it suffers from catastrophic loss of precision:

Code: Select all

            basePixel[0] = (basePixel[0]-127.5)
            basePixel[1] = (basePixel[1]-127.5)
            basePixel[2] = basePixel[2]

            detailPixel[0] = (127.5-detailPixel[0])
            detailPixel[1] = (127.5-detailPixel[1])
            detailPixel[2] = (detailPixel[2]-127.5)

            partialResult = ((basePixel[0]*detailPixel[0])+(basePixel[1]*detailPixel[1])+(basePixel[2]*detailPixel[2]))/(basePixel[2]*16256.25)
           
            subPixel0 = ((basePixel[0]*partialResult) - detailPixel[0]) + 127.5
            subPixel1 = ((basePixel[1]*partialResult) - detailPixel[1]) + 127.5
            subPixel2 = ((basePixel[2]*partialResult) - detailPixel[2]) + 127.5
My thinking was; remove the division by 127.5 from the first six statements, then divide each subsequent occurrence of basePixel[n] and detailPixel[n] by 127.5, then simplify.

For example, partialResult becomes:

Code: Select all

(((basePixel[0]/127.5)*(detailPixel[0]/127.5))+((basePixel[1]/127.5)*(detailPixel[1]/127.5))+((basePixel[2]/127.5)*(detailPixel[2]/127.5)))/basePixel[2]
Each product simplifies to:

Code: Select all

(basePixel[n]*detailPixel[n])/127.5^2
Then the divisions by 127.5^2 can be factored out giving the result in the first codeblock.

And the subPixels become:

Code: Select all

subPixeln = 127.5*((((basePixel[n]/127.5)*partialResult)-(detailPixel[n]/127.5))+1)
Which simplifies to the result in the first codeblock.
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
rocketsoup
Posts: 196
Joined: Wed Dec 14, 2016 8:53 pm
Location: Fort Collins, CO
Contact:

Re: Can this be optimised any further?

Post by rocketsoup » Tue Apr 25, 2017 7:56 am

I looked at your original code, since that works. Here's what I came up with. No floating points. This is pseudocode as I won't embarrass myself trying to ape functioning Python. :) Note: I'm using prime ' ticks on bp0-2 and dp0-2 to differentiate them from their raw input values. I avoid assigning new values to input variables, as it gets confusing sometimes.

Also note I'm using an offset of 128, not 127.5. Signed bytes use a range of [-128..+127] and is likely the convention your source is using. Using 127.5 makes all your values floating points, which slows computation, can't represent a true 0 value, and probably doesn't match what your source is doing. If you're worried about that weird asymmetrical -128, your clamp function at the end will keep it from doing weird stuff. Since we're dealing exclusively with ints now there's no need to call int() on your end result anymore.

Code: Select all

# Base pixel
bp0' = bp0 - 128    # In the range of [-128, 127]
bp1' = bp1 - 128    # [-128, 127]
bp2' = bp2 * 2      # [0, 510] (original code has no bias, so its range is [0, 2). Intentional?)

# Detail pixel
dp0' = 127 - dp0    # [-128, 127] (the 127 offset is intentional to maintain the expected result range)
dp1' = 127 - dp1    # [-128, 127]
dp2' = dp2 - 128    # [-128, 127]

# Partial result
p = (bp0' * dp0' + bp1' * dp1' + bp2 * dp2') // (bp2' * 127)    # [-128, 127]

# Result
r0 = clamp((bp0' * p // 127 - dp0') + 127, 0, 255)    # [0, 255]
r1 = clamp((bp1' * p // 127 - dp1') + 127, 0, 255)    # [0, 255]
r2 = clamp((bp2' * p // 127 - dp2') + 127, 0, 255)    # [0, 255]
(Disclaimer: I didn't test this. Ha ha. Just moved code around by eyeballing it.)

EDIT: Discovered Python does integer division with the // operator, so I put that in. Integer division gets you 11 // 2 = 5 (floored) rather than 11 / 2 = 5.5, and we don't want to introduce floating points. Also changed bp2' as per my latter comment.
Forum moderator/admin - @rocket_soup - Rocket soo :feliciaOoh: ooup!

User avatar
rocketsoup
Posts: 196
Joined: Wed Dec 14, 2016 8:53 pm
Location: Fort Collins, CO
Contact:

Re: Can this be optimised any further?

Post by rocketsoup » Tue Apr 25, 2017 7:59 am

My bp2' may be wrong. In your original code you're dividing by 127.5 without a bias, meaning your variable is in the range of [0, 2). So actually my bp2' should probably be multiplied by 2 to match that accurately.
Forum moderator/admin - @rocket_soup - Rocket soo :feliciaOoh: ooup!

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Tue Apr 25, 2017 1:53 pm

The source I'm basing this on is written in HLSL and it begins with a function call to a function called tex2D() which returns values in the range 0 to 1. The reason I used 127.5 is because that's what results from simplifying the original statements. As far as I know, HLSL shader code operates entirely in floating-point so integer byte values are never seen. Aaannnnd, I don't know why the original bp2 has no bias beyond that being what the maths dictates.


Anyway, I've tested your code and the good news is it's faster, the bad news is, that like mine, it gives the wrong result. :( Oddly, it appears at a glance to give the same wrong result?? :feliciaGranny:
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
rocketsoup
Posts: 196
Joined: Wed Dec 14, 2016 8:53 pm
Location: Fort Collins, CO
Contact:

Re: Can this be optimised any further?

Post by rocketsoup » Tue Apr 25, 2017 2:10 pm

Well, at least you can be disappointed in a more timely manner! :lol:
Forum moderator/admin - @rocket_soup - Rocket soo :feliciaOoh: ooup!

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Tue Apr 25, 2017 2:15 pm

:lol:

I just double checked and both lots of code do indeed give the same wrong result... :?

EDIT: This is the HLSL:

Code: Select all

float3 t = tex2D(texBase,   uv).xyz*float3( 2,  2, 2) + float3(-1, -1,  0);
float3 u = tex2D(texDetail, uv).xyz*float3(-2, -2, 2) + float3( 1,  1, -1);
float3 r = t*(dot(t, u)/t.z) - u;
return r*0.5 + 0.5;
t is the base normal map, u is the detail normal map, r is the result.

Looking at all this again from the start, :feliciaChew:
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Wed Apr 26, 2017 9:09 am

This code works! :D

Code: Select all

            basePixel[0] = (basePixel[0]-127.5)
            basePixel[1] = (basePixel[1]-127.5)
            basePixel[2] = (basePixel[2])

            detailPixel[0] = (127.5-detailPixel[0])
            detailPixel[1] = (127.5-detailPixel[1])
            detailPixel[2] = (detailPixel[2]-127.5)

            partialResult = (((basePixel[0]*detailPixel[0])+(basePixel[1]*detailPixel[1])+(basePixel[2]*detailPixel[2]))/basePixel[2])/127.5
            
            subPixel0 = ((basePixel[0]*partialResult)-detailPixel[0])+127.5
            subPixel1 = ((basePixel[1]*partialResult)-detailPixel[1])+127.5
            subPixel2 = ((basePixel[2]*partialResult)-detailPixel[2])+127.5
It processes two 1024x1024 normal maps in approx. 7 seconds, which is ~0.3s improvement over the previous fastest version. For some reason the double division seems to be slighty quicker than one divison and one multiply for calculating 'partialResult.' I think all that's left to do now is convert it to use integers only.

Just for reference, the very first version I wrote took 11 seconds to process the same input.

EDIT: Added bonus, this version appears to gives a better output!
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Wed Apr 26, 2017 2:33 pm

Upon further testing, checking etc... I've realised that the scaling isn't necessary to retain precision, as long as the dot product is sufficiently precise it all works correctly. Meaning the final subpixel equations must have a division in them since dividing the partial result in order to reduce the number of divisions from three to one loses too much precision and corrupts the results.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Okay, here's what might possibly be the final version (for now)

Code: Select all

#!/usr/bin/env python
#Combines two normal maps, (base + detail) using reoriented normal mapping technique
#described here: http://blog.selfshadow.com/publications/blending-in-detail/
#Based on this particular HLSL code snippet:
#float3 t = tex2D(texBase,   uv).xyz*float3( 2,  2, 2) + float3(-1, -1,  0);
#float3 u = tex2D(texDetail, uv).xyz*float3(-2, -2, 2) + float3( 1,  1, -1);
#float3 r = t*(dot(t, u)/t.z) - u;
#return r*0.5 + 0.5;
#
#Note: tex2D scales from 0-255 to 0-1 for RGBA textures.

#Also makes use of the technique found here:http://shallowsky.com/blog/gimp/pygimp-pixel-ops.html
#This allows fast writing of result pixels.

#To do:
#      Add normalisation option.

import time
from array import array
from gimpfu import *

def combine_normals(image,selectedLayer,baseLayer,detailLayer):
   
    starttime = time.time()

    #Check that both layers are the same size as the image
    if((not(image.width == baseLayer.width == detailLayer.width) or not(image.height == baseLayer.height == detailLayer.height))):
        gimp.message("ERROR: Image & Layer dimensions do not match.")
        return

    # Indicates that the process has started.
    gimp.progress_init("Combining Layers...")

    # Set up an undo group.
    pdb.gimp_image_undo_group_start(image)
    
    # Create a new layer
    newLayer = gimp.Layer(image, "Result", image.width, image.height, RGBA_IMAGE, 100, NORMAL_MODE)
    image.add_layer(newLayer, 0)
    
    # Clear the new layer.
    pdb.gimp_edit_clear(newLayer)
    newLayer.flush()
    
    # Combine the normal maps
    try:
        srcHeight = image.height
        srcWidth = image.width
        
        # Get the pixel regions, one per layer
        baseRegion = baseLayer.get_pixel_rgn(0, 0, srcWidth, srcHeight, False, False)
        detailRegion = detailLayer.get_pixel_rgn(0, 0, srcWidth, srcHeight, False, False)
        destRegion = newLayer.get_pixel_rgn(0, 0, srcWidth, srcHeight, True, True)

        pixel_size = len(baseRegion[0,0])
        #source pixels
        basePixels = array("B", baseRegion[0:srcWidth, 0:srcHeight])
        detailPixels = array("B", detailRegion[0:srcWidth, 0:srcHeight])
        #destination pixels
        destPixels = array("B", "\x00" * (srcWidth * srcHeight * pixel_size))

        #Each pixel is (pixel_size) elements of the array(s).
        #Indexing slice: from n*pixel_size to (n*pixel_size)+pixel_size
        #This is the same for all three regions since we are combining pixels, not moving them.        
        num_pixels = (srcHeight * srcWidth)
        Update = num_pixels // 10
   
        for n in xrange(num_pixels):
            if(n % Update == 0):
                gimp.progress_update(float(n) / num_pixels)
            #Indices for slicing
            start = (n * pixel_size)
            end = start + pixel_size

            basePixel = basePixels[start:end]
            detailPixel = detailPixels[start:end]
            destPixel = destPixels[start:end]            
            #base 'pixel' components
            bP0 = (basePixel[0]-128)
            bP1 = (basePixel[1]-128)
            bP2 = (basePixel[2])
            #detail 'pixel' components
            dP0 = (128-detailPixel[0])
            dP1 = (128-detailPixel[1])
            dP2 = (detailPixel[2]-128)
            
            dotProduct = (bP0*dP0)+(bP1*dP1)+(bP2*dP2)
            
            subPixel0 = (((bP0*dotProduct)//(bP2*128))-dP0)+128
            subPixel1 = (((bP1*dotProduct)//(bP2*128))-dP1)+128
            #subPixel2 = (((bP2*dotProduct)//(bP2*128))-dP2)+128
            subPixel2 = ((dotProduct//128)-dP2)+128
            
            destPixel[0] = 0 if subPixel0 < 0 else 255 if subPixel0 > 255 else subPixel0
            destPixel[1] = 0 if subPixel1 < 0 else 255 if subPixel1 > 255 else subPixel1
            destPixel[2] = 0 if subPixel2 < 0 else 255 if subPixel2 > 255 else subPixel2
            destPixel[3] = 255
            
            destPixels[start:end] = destPixel

        destRegion[0:srcWidth, 0:srcHeight] = destPixels.tostring() 

        newLayer.name = str("Result (%.3f)" % (time.time()-starttime))
        #Update the new layer.
        newLayer.flush()
        newLayer.merge_shadow(True)
        newLayer.update(0, 0, newLayer.width, newLayer.height)
        
    except Exception as err:
        #clean up
        image.remove_layer(newLayer)
        raise 
    finally:
        # Close the undo group.
        pdb.gimp_image_undo_group_end(image)
    
        # End progress.
        pdb.gimp_progress_end()
    
register(
    "python_fu_combine_normals",
    "Combine Two Normal Maps",
    "Correctly combine two tangent-space normal maps using 'Reoriented Normal Mapping'",
    "SZ",
    "SZ",
    "2017",
    "Combine Normal Maps",
    "RGBA",
    [
        (PF_IMAGE, "image", "Image", None),
        (PF_LAYER, "layer1", "selected Layer", None),
        (PF_LAYER, "layer2", "Base Layer", None),
        (PF_LAYER, "layer3", "Detail Layer", None),
    ],
    [],
    combine_normals, menu="<Image>/Filters/Combine")

main()
This runs the 1024x1024 test case in ~3.65s :)
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Thu Apr 27, 2017 5:51 pm

I decided to poke around with the Python threading module, to my surprise two threads are slower than one!? When I checked CPU usage with a single thread, it was 30% (which is odd, since one thread should max out @ 25%), when I checked with two threads it was; 30% ???

There's something weird with the way gimp runs Python scripts; they're supposed to run in a separate process, but that would mean they are unrestricted in terms of CPU use which appears to be incorrect. hmm.

EDIT: Yet more poking has revealed that it's the pythonw.exe itself (the one used by gimp) that is only using 30% cpu regardless of how many threads are created...

And further reading reveals that it's a limitation of "standard" Python. *facepalm* :feliciaRage: :feliciaSoAngry:
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

User avatar
Molokov
Posts: 16
Joined: Mon Dec 19, 2016 12:41 pm

Re: Can this be optimised any further?

Post by Molokov » Mon May 01, 2017 3:39 am

Yeah, Python (2.x) doesn't really properly implement threading, Python threads are really pseudothreads. And you have to be very careful when using them. So it's good for parellelising functionality, but not for optimising algorithms - the overheads are high, and things tend to execute sequentially on the core.

User avatar
Zuffy
Posts: 153
Joined: Fri Dec 16, 2016 4:55 am

Re: Can this be optimised any further?

Post by Zuffy » Thu May 04, 2017 9:32 am

Python does apparently support multiprocessing, which would neatly sidestep the global interpreter lock issue, if I find the motivation I might have a look at that. :)
This is my signature placeholder text. I plan to replace it with something useful, interesting etc...

Post Reply