The Monument Engine: Jx

The team at Monument has produced composable multi-threaded concurrency, 100% compatible with the J language. We've called it Jx, or the "Monument engine."

It's fast and well-suited for rapid development. You can use all the cores on your computer, simply and elegantly.

With only slight modification of existing code, Jx design enables users to produce parallel rank, interleave concurrent functions, and call several efficient parallel primitives.

Email us at info@monument.ai with questions and comments.

Downloads

Binaries

By downloading or using these files, you agree to evaluation licensing.

These files are drop-in replacements for the J engine. Select AVX or AVX2 binaries based on your machine type, which you can find by running 2!:7'' in J. For example, if your machine is a Windows with AVX2 instructions, you could run ./jconsole -lib javx2.dll -jprofile.

Licensing

  • These binaries are built on commercial license from Jsoftware, and we’re making them available for evaluation, pending further decisions on licensing.
  • The evaluation license is available for review here.

You may need to install openMP runtime if it is not already installed on your macOS or Linux machine. An OpenMP runtime is included with the Windows download.

Multi-threading

Futures define a general method for parallelism at the thread level. Futures are quite general for thread level operations: they allow multiple tasks to share an address space, share a J namespace, and to work simultaneously in that space. The method can be used simple as a parallel rank operator (in which each task runs the same code), but it can also be used to launch completely separate tasks running different portions of a problem.

Any function can generate a future, and the future will return immediately. Consuming the result of a future will block until completion. The ideal application is one that can be broken into pieces that can run for several hundred milliseconds without heavy use of L3 cache. The pieces need not be identical.

In addition, Jx has undergone extensive testing to stress and verify the implementation for:

  • symbol management,
  • memory allocation, and
  • task switching

Shared Memory

Memory is shared between processes with zero unneccessary data movement or serialization overhead. Users can call multiple functions on shared arguments, and access global variables on any thread. Globals and locales are OK for a future to read, but not to write. Locales created by the future can be written by the future.

Shared Namespaces

All tasks share the same global namespace of the master thread. (As always in J, a task can create a private namespace in a locale if it wants to). You do not have to take pains to make all your names visible to an external process.

Primitive-level Parallelism

Matrix multiply and matrix inversion run efficiently on multiple threads, as they are computationally intense and require little data sharing. Other parallelized primitives have so far proven to be useful only for very large arguments. Primitive-level threaded is currently disabled except for matrix multiply and matrix inversion. Parallel primitives are currently not permitted to be nested with multi-threading: primitives will override user-defined functions if M and N in 128!:30 (M N) are both greater than 1.

Usage

The futures model is elegant and flexible. In the simplest form, called parallel rank, the same task is run of different cells of an argument. In J, this means replacing

Result =. DoMyFunction"2 ArgumentArray

with

Result =. (DoMyFunction .. 0)"2 ArgumentArray

That simple change will cause each cell of ArgumentArray to be processed in its own task. The Result will be an array of futures, which will be filled in one by one as the tasks complete. The master thread can continue with other work until it needs to know the result values.

At a higher level, you can break your application into whatever pieces you like:

   Res1 =. (part1 .. 0) arguments
   Res2 =. (part2 .. 0) arguments
   Res3 =. (part3 .. 0) arguments
   etc.

The tasks will run in parallel. One thread per core gets you as much computation as you can get. In general, J is efficient enough in computation that hyperthreading will not enhance performance.

Break will interrupt 1 thread and get reset immediately before the other threads have stopped.

Easy Experimentation

If you want to see whether your application will benefit from parallelism, you can quickly find out by hiving part of it off to a separate task. All you have to do it use (..) to run a verb in a different core. You don't have to split the program into separate blocks and run them in different processes.

128!:30 (M N)    		NB. Set thread count.
				NB. M is the number of threads available for parallel primitives, also settable with 128!:30 (M)
				NB. N is the number of threads available for futures. Maximium 63.
 				NB. M or N in {0,1} turn off parallelization at that respective level
				NB. Nested parallelism is currently disabled. It can be enabled upon request.
				NB. 128!:30 returns the prior settings for 128!:30

parallelrank =:  {{u .. 0"n}}
> +/ parallelrank (1) i. 10 10	NB. Apply a drop-in replacement for " using futures

_1 (9!:58)"0 i.3 		NB. Disable BLAS for parallel matrix math primitives

Examples

The form (u .. 0) can be applied to any verb u. u will then run across an array of nouns, applying all threads made available with 128!:30.

NB. Examples of multi-threading user-defined functions.
NB. This parameter must be set in order for multi-threading to work.

threadcount =. 10      		NB. Set equal to or less than count of CPU physical cores,
                                NB. and less than 63 (an arbitrary limit we have imposed).
				NB. Hyperthreading probably will not help as J utilizes cores effectively.
128!:30 (1, threadcount)

NB. =========================================================
NB. Test threads on futures
NB. =========================================================

{{+/ .. 0 y}} i.10		NB. Create a future, which will not block until unboxed. 


{{> +/ .. 0"1 y}} i. (10 10) 	NB. Basic example of parallel rank.
{{> +/ .. 0"2 y}} i. (10 10)	NB. Change rank.

Here's an example which calculates and visualizes the Mandelbrot set, extended from Rosetta Code

load 'viewmat'

NB. Computation and display
mcf=. (<: 2:)@|@(] ((*:@] + [)^:((<: 2:)@|@])^:1000) 0:) 	NB. 1000 iterations test
domain=. |.@|:@({.@[ + ] *~ j./&i.&>/@+.@(1j1 + ] %~ -~/@[))&>/
viewmat > (mcf .. 0"0) domain (_2j_1 1j1) ; 0.005

NB. Timing comparison on single vs multiple threads
6!:2 'mcf "0 domain (_2j_1 1j1) ; 0.005' 		NB. Serial timing (single thread)
6!:2 '> (mcf .. 0"0) domain (_2j_1 1j1) ;0.005'		NB. Parallel timing (multi-threading)

Futures can also be created individually to produce more complex concurrency patterns inside functions.

wait =. (6!:3 .. 0) 		NB. Demonstrate futures.
time =. 6!:2
run =. {{
 c1 =. wait 1
 c2 =. wait 2
 c3 =. wait > c1
 +/ ; c1,c2,c3
}}

NB. If running this interactively in a Jconsole, `128!:30`
NB. must return before running `time` (i.e. run each line independently).

128!:30 (1 1)  			NB. Set single-threaded operation.
time 'run 0'			NB. Run should require four seconds on a single thread.
128!:30 (1 4)			NB. Set multi-threading for tasks to 4.
time 'run 0'			NB. Run should require two seconds with multi-threading.

NB. Futures will block main thread (i.e. if called interactively in a console), but 
NB. will not block other futures (i.e. if run inside a function, as above). 

Matrix primitives are also parallelized, separately and distinct from multi-threading.

NB. =========================================================
NB. Test number of cores on matrix primitives
NB. =========================================================
load 'plot'

ptest =: 4 : 0
128!:30 (y) 			NB. Set cores to whatever passed into y
10 (6!:2) x 			NB. Run timer on whatever the command is passed into x
)

a =: 200 200 ?@$ 0
_1 (9!:58)"0 i.3 		NB. Turn off BLAS for all matrices
Jmatrix =: ('a (+/ . *) a' ptest ])"0 ] 1+i.6
0 (9!:58)"0 i.3 		NB. Turn on BLAS for all matrices
BLAS =: ('a (+/ . *) a' ptest ])"0 ] 1+i.6
plot (BLAS % Jmatrix) 		NB. Speedup over BLAS. X axis is core count

Implementation

Jx futures are built on the OpenMP task model. We create a number of worker threads, normally one per core in your system. The master thread executes your program, keeping the worker threads at the ready. Creating a parallel task is very simple: instead of writing

Result =. DoMyFunction arguments

you write

Result =. (DoMyFunction .. 0) arguments

The new primitive is the (..), which says 'run this in a separate task'.

The Result of the task is called a future. It is a box containing the result of the task. The future can be passed around like any other value. The master thread continues execution after the task is started, and may start other tasks or do work on its own.

Eventually the master thread will need to see the result of the task it spawned. It does so by looking inside the future. If the task has finished, its result will be contained in the future-box, and the master thread will continue in stride. If the task is still running, the master thread will block until the task is ready. Because a task is created only when it has enough resources to run to completion, the master thread will always be able to continue when its spawned task completes.

Parallel Operators

In general, parallelizing individual operators has proven in testing to not produce performance improvements over a single thread. This is because the cost of moving data to a different core generally exceeds the benefit of performing calculations in parallel. A notable exception is matrix math, which is capable of high single-thread utilization on a small amount of data. We thus enable users to apply multiple threads to matrix operators.

BLAS

In the equivalent version of J from Jsoftware, BLAS is used for matrix multiplication above certain matrix size thresholds which vary by OS. BLAS is multi-threaded and will use all available machine cores. However, matrix multiply and inversion using BLAS are less performant than Jx. In addition, as parallelization nesting is currently disabled, BLAS will interfere with multi-threaded futures. We recommend disabling BLAS when using multi-threading.

_1 (9!:58)"0 i.3 		NB. Disable BLAS.
0 (9!:58)"0 i.3			NB. BLAS can be reactivated at for all matrices with this command.
(300 300 300) (9!:58)"0 i.3 	NB. BLAS can be reactivated at matrix dimensions above these thresholds for floats, integers, and complex numbers, respectively.