1.3 JIT - A hand-made function

Allocating memory
Filling it with instructions
Endianness
Operator enums to x86
Calling conventions
Ops breakdown: Print, Exit, Draw
Running the JIT
Drop

This will be a Just-In-Time compiled function in x86_64 CPU byte-ops, using the System V AMD64 ABI.

All going well it should work on Linux, Mac and Windows.

Using the JIT goes as follows:

allocate some read-write memory, represented by JitMemory
fill the memory with instrucion bytes while iterating over a Vec<Operator>
mark the allocated memory as executable, JitMemory::to_jit_fn() should return a JitFn
call it as a function with JitFn::run(), which transmutes the pointer to the executable memory into a function pointer and calls that pointer

This memory has to be requested in multiples of the page size supported by the CPU, for x86_64 the smallest valid size is 4k.

The OS will not allow a memory region to be both writable and executable at the same time, so these stages have to be clearly separated.

JitMemory keeps track of the address, the size we requested and the index offset where we are going to write the next byte.

JitFn maps to an address already marked as executable.

// jit/mod.rs

// Allocate memory sizes as multiples of 4k page.
const PAGE_SIZE: usize = 4096;

/// An executable memory buffer filled with `x86` instructions.
pub struct JitFn {
    addr: *mut u8,
    size: usize,
}

/// A read-write memory buffer allocated to be filled with bytes of `x86`
/// instructions. This is a private struct, use `JitFn::new()`. This way the
/// allocated memory address is only freed when the JitFn goes out of scope.
struct JitMemory {
    addr: *mut u8,
    size: usize,
    /// current position for writing the next byte
    offset: usize,
}

Allocating memory

The allocated memory block has to be aligned on a boundary of 16 bytes, otherwise we won’t be able to use it as an executable function on both Windows, Mac and Linux.

On Linux and Mac, libc::malloc() simply allocates a block at the next convenient address, and doesn’t guarantee any kind of alignment.

The libc manual points out 3.2.3.6 Allocating Aligned Memory Blocks, which describes posix_memalign(). The function signature is:

int posix_memalign (void **memptr, size_t alignment, size_t size)

We can use this to request an aligned memory block of a certain size, then use mprotect() to mark it as read-write.

Because this memory can contain anything when we get it, it’s a good idea to use memset() to fill it with ret calls (0xC3), so if our function moves to an unexpected instruction, at least it will just return.

posix_memalign() sets a C **void type pointer in the first argument. We have to transmute this to a *mut u8 Rust type, then we are good to go.

// jit/mod.rs

impl JitMemory {
    /// Allocates read-write memory aligned on a 16 byte boundary.
    #[cfg(any(target_os = "linux", target_os = "macos"))]
    pub fn new(num_pages: usize) -> JitMemory {
        let size: usize = num_pages * PAGE_SIZE;
        let addr: *mut u8;

        unsafe {
            // Take a pointer
            let mut raw_addr: *mut libc::c_void = mem::uninitialized();

            // Allocate aligned to page size
            libc::posix_memalign(&mut raw_addr,
                                 PAGE_SIZE,
                                 size);

            // Mark the memory as read-write
            libc::mprotect(raw_addr,
                           size,
                           libc::PROT_READ | libc::PROT_WRITE);

            // Fill with 'RET' calls (0xc3)
            libc::memset(raw_addr, 0xc3, size);

            // Transmute the c_void pointer to a Rust u8 pointer
            addr = mem::transmute(raw_addr);
        }

        JitMemory {
            addr: addr,
            size: size,
            offset: 0,
        }
    }
}

On Windows, we can use VirtualAlloc, the function signature is:

LPVOID WINAPI VirtualAlloc(
  _In_opt_ LPVOID lpAddress,
  _In_     SIZE_T dwSize,
  _In_     DWORD  flAllocationType,
  _In_     DWORD  flProtect
);

And so the equivalent JitMemory::new() for Windows:

// jit/mod.rs

impl JitMemory {
    #[cfg(target_os = "windows")]
    pub fn new(num_pages: usize) -> JitMemory {
        let size: usize = num_pages * PAGE_SIZE;
        let addr: *mut u8;

        unsafe {
            // Take a pointer
            let raw_addr: *mut winapi::c_void;

            // Allocate aligned to page size
            raw_addr = kernel32::VirtualAlloc(
                ptr::null_mut(),
                size as u64,
                winapi::MEM_RESERVE | winapi::MEM_COMMIT,
                winapi::winnt::PAGE_READWRITE);

            if raw_addr == 0 as *mut winapi::c_void {
                panic!("Couldn't allocate memory.");
            }

            // NOTE no FillMemory() or SecureZeroMemory() in the kernel32 crate

            // Transmute the c_void pointer to a Rust u8 pointer
            addr = mem::transmute(raw_addr);

        }

        JitMemory {
            addr: addr,
            size: size,
            offset: 0,
        }
    }
}

Filling it with instructions

The JitAssembler trait describes what JitMemory will have to implement so that we can assemble our function.

push_u8() puts in single-byte values. We can use this to construct 4-byte u32 and 8-byte u64 values in the memory, fill_jit() makes use of these. Eventually to_jit_fn() marks the memory block as executable and returns a JitFn where we can use JitFn::run() to call the function.

// jit/mod.rs

pub trait JitAssembler {

    /// Marks the memory block as executable and returns a `JitFn` containing
    /// that address.
    fn to_jit_fn(&mut self) -> JitFn;

    /// Fills the memory block with `x86` instructions while iterating over a
    /// list of `Operator` enums.
    fn fill_jit(&mut self, context: &mut Context, operators: &Vec<Op>);

    /// Writes one byte to the memory at the current index offset and increments
    /// the offset.
    fn push_u8(&mut self, value: u8);

    /// Writes a 4-byte value. `x86` specifies Little-Endian encoding,
    /// least-significant byte first.
    fn push_u32(&mut self, value: u32);

    /// Writes an 8-byte value.
    fn push_u64(&mut self, value: u64);
}

Implementing push_u8() assigns the value to the byte at addr + offset and increments the offset.

// jit/mod.rs

fn push_u8(&mut self, value: u8) {
    unsafe { *self.addr.offset(self.offset as _) = value };
    self.offset += 1;
}

Endianness

Implementing push_u32() and push_u64() has to construct these multi-byte values from single bytes, and the CPU architecture will specify if we must use either Little-Endian or Big-Endian encoding, or can do both.

At least this doesn’t vary between OSes, the CPU specifies it, and the Endianness table tells us that all the x86 CPUs stick to Little-Endian (LE).

The naming is counter-intuitive, the “end” refers to the front- or starting end (first byte by pointer address), not to the “final portion” end (last byte by pointer address).

There are advantages in this for the machine. The pointer address points to the first byte, and casting between interger types doesn’t need to relocate the bytes, just ignore or append to the end.

For small intergers it will even preserve the value:

type	bytes (LE)	value
`u32`	`7A 00 00 00`	122
`u16`	`7A 00`	122
`u32`	`0D 0C 0B 0A`	168496141
`u16`	`0D 0C`	3085

Endianness

(Diagram from the Endianness Wikipedia page.)

Implementing it means bit-shifting the value, zeroing the higher bits by taking a bitwise & with 0xFF and what’s left is an u8 byte:

// jit/mod.rs

fn push_u32(&mut self, value: u32) {
    self.push_u8(((value >>  0) & 0xFF) as u8);
    self.push_u8(((value >>  8) & 0xFF) as u8);
    self.push_u8(((value >> 16) & 0xFF) as u8);
    self.push_u8(((value >> 24) & 0xFF) as u8);
}

Operator enums to x86

We expressed the operations we want to happen with a list of enums, it’s time to fill the JIT (that is, write the instructions to memory) with a function that iterates over those enums, while also passing an &mut Context pointer so we can manipulate the application’s state.

fn fill_jit(&mut self, context: &mut Context, operators: &Vec<Operator>);

By convention we have to start our program with the function prologue and end with the epilogue. If you were writing assembly on Linux or Mac:

; prologue
push rbp
mov rbp, rsp

; ... action and excitement

; epilogue
mov rsp, rbp
pop rbp
ret

And so we have to put the equivalent byte opcodes in the memory. To preserve our sanity we will compose small functions which put in a few bytes at a time.

This way, when we have our JitMemory, we can call these one after the other as if writing assembly.

For example the push_rbp() and mov_rbp_rsp() would be:

// jit/mod.rs

impl JitMemory {
    pub fn push_rbp(&mut self) {
        self.push_u8(0x55);
    }

    pub fn mov_rbp_rsp(&mut self) {
        self.push_u8(0x48);
        self.push_u8(0x89);
        self.push_u8(0xe5);
    }
}

And then we can use this in fill_jit() to start and end our hand-made function, and in between we can implement action and excitement!

// jit/mod.rs

impl JitAssembler for JitMemory {

    fn fill_jit(&mut self, context: &mut Context, operators: &Vec<Op>) {
        // prologue
        self.push_rbp();
        self.mov_rbp_rsp();

        for op in operators.iter() {
            match *op {
                Op::NOOP => (),
                Op::Draw(sprite_idx, offset, speed) => {
                    /* push x86 byte instructions to memory as if writing
                    assembly, but we are assembling the machine code directly */
                },
                Op::Print => { /* ... */ },
                Op::Clear(charcode) => { /* ... */ },
                Op::Exit(limit) => { /* ... */ },
            }
        }

        // epilogue
        self.mov_rsp_rbp();
        self.pop_rbp();
        self.ret();
    }
}

Calling conventions

Linux and Mac

The Linux x86_64 ABI is sysv64, the first few arguments are passed in registers, any remaining ones are passed on the stack.

The first six integer arguments are passed in:

rdi, rsi, rdx, rcx, r8, r9

Floating-point arguments are passed in xmm0 - xmm7.

The function prologue and epilogue will be:

; prologue
push rbp
mov rbp, rsp

; function body

; - put function arguments in registers
; - put the Rust function pointer in rax

call rax

; epilogue
mov rsp, rbp
pop rbp
ret

The rsp register must point to an address on a 16-byte boundary before the call jumps. Remember that call will push rdi, moving rsp with -8 bytes immediately before the jump.

In the prologue we made one push (rsp moved -8 bytes), and call will add another push (-8 again), which is -16, so we don’t have to sub the value in rsp to adjust.

At that point we can call the address of the Rust function which corresponds to the operator enum.

If we were counting wrong, we would find out here, because our JIT function would segfault.

After the call, it is good to note that if we didn’t have enough registers for the function arguments and we pushed the remaining ones to the stack, then here we would have to clean up (the stack) by adding to the address value of rsp.

But there is no cleaning to do at this time and we can just finish after the call.

Windows

We don’t have to write the JIT assembler two times. Rust will compile the extern sysv64 functions on Windows as well.

Ops breakdown: Draw

Op::Draw(u8, u8, f32)

Defining the function in the trait:

// jit/ops.rs

pub trait Ops {
    extern "sysv64" fn op_draw(&mut self, sprite_idx: u8, offset: u8, speed: f32);
}

impl Ops for Context {
    extern "sysv64" fn op_draw(&mut self, sprite_idx: u8, offset: u8, speed: f32) {
        self.impl_draw(sprite_idx, offset, speed);
    }
}

While the actual implementation is on the Context struct:

impl Context {
    /// Write a text sprite into the buffer, starting at `offset` and moving
    /// with `speed`.
    pub fn impl_draw(&mut self, sprite_idx: u8, offset: u8, speed: f32) {
        if (sprite_idx as usize) < self.sprites.len() {

            let total_offset: usize = ((offset as f32 + self.time * speed) % (self.buffer.len() as f32)) as usize;

            let ref sprite = self.sprites[sprite_idx as usize];

            let mut v: Vec<char> = vec![];
            for ch in sprite.chars() { v.push(ch); }

            for i in 0 .. v.len() {
                let n = (total_offset + i) % self.buffer.len();
                self.buffer[n] = v[i];
            }
        }
    }
}

We will take the memory address of this and This is what we call in the JIT.

That is, we call from the JIT to a Rust function, where &mut self is a &mut Context, so the first argument is that pointer address (integer value).

In addition we are passing integer and float arguments as well. And so in fill_jit():

// jit/mod.rs

// Linux and Mac
Op::Draw(sprite_idx, offset, speed) => {
    // rdi: pointer to Context (pointer is an integer value)
    self.movabs_rdi_u64( unsafe { mem::transmute(context as *mut _) });
    // rsi: sprite_idx arg. (interger)
    self.movabs_rsi_u64(sprite_idx as u64);
    // rdx: offset arg. (interger)
    self.movabs_rdx_u64(offset as u64);
    // xmm0: speed arg. (floating point)
    self.movss_xmm_n_f32(0, speed);

    self.movabs_rax_u64( unsafe { mem::transmute(
        Ops::draw as extern "sysv64" fn(&mut Context, u8, u8, f32)
    )});
    self.call_rax();
},

Running the JIT

The function we crated byte-by-byte is ready to be executed.

In the main drawing loop, we call the JIT function at every frame, passing it a &mut Context pointer.

// jit/mod.rs

impl JitFn {
    pub fn run(&self, context: &mut Context) {
        if self.size == 0 {
            return;
        }
        unsafe {
            // type signature of the jit function
            let fn_ptr: extern fn(&mut Context);
            // transmute the pointer of the executable memory to a pointer of the jit function
            fn_ptr = mem::transmute(self.addr);
            // use the function pointer
            fn_ptr(context)
        }
    }
}

Drop

Since we allocated this piece of memory manually, we are also responsible for freeing it. libc::munmap and kernel32::VirtualFree() will do the job. The signatures:

// Linux and Mac
int munmap (void *addr, size_t length)

// Windows
BOOL WINAPI VirtualFree(
  _In_ LPVOID lpAddress,
  _In_ SIZE_T dwSize,
  _In_ DWORD  dwFreeType
);

The user should use JitFn::new(), giving them an already executable memory address.

JitFn::new(num_pages: usize, context: &mut Context, operators: &Vec<Op>) -> JitFn

JitMemory (a writable memory) is a private struct used only by the lib, so the user can’t end up with both a JitFn and JitMemory storing the same memory address.

The memory address we allocated is then stored by JitFn. The variable assigned to JitFn will be freed when it goes out of scope, but nobody knows about this other piece of memory we allocated, so we have to implement freeing that as well.

We can implement the Drop trait to do additional clean-up at the time when JitFn goes out of scope.

// jit/mod.rs

impl Drop for JitFn {
    #[cfg(any(target_os = "linux", target_os = "macos"))]
    fn drop(&mut self) {
        unsafe { libc::munmap(self.addr as *mut _, self.size); }
    }

    #[cfg(target_os = "windows")]
    fn drop(&mut self) {
        unsafe { kernel32::VirtualFree(self.addr as *mut _, 0, winapi::MEM_RELEASE); }
    }
}