Engineering Seminar Topics :: Seminar Paper: Cython

Cython is a programming language based on Python, with extra syntax allowing for optional static type declarations. It aims to become a super-set of the Python language which gives it high-level, object oriented, functional, and dynamic programming. The source code gets translated into optimized C/C++code and compiled as Python extension modules. This allows for both very fast program execution and tight integration with external C libraries, while keeping up the high programmer productivity for which the Python language is well known.

The primary Python execution environment is commonly referred to as CPython, as it is written in C. Other major implementations use Java (Jython), C# (Iron Python) and Python itself (PyPy). Written in C, CPython has been conducive to wrapping many external libraries that interface through the C language. It has, however, remained on trivial to write the necessary glue code in C, especially for programmers who are more fluent in a high-level language like Python than in a do-it-yourself language like C.

Originally based on the well-known Pyrex, the Cython project has approached this problem by means of a source code compiler that translates Python code to equivalent C code. This code is executed within the CPython runtime environment, but at the speed of compiled C and with the ability to call directly into C libraries. At the same time, it keeps the original interface of the Python source code, which makes it directly usable from Python code. These two-fold characteristics enable Cython’s two major use cases: extending the CPython interpreter with fast binary modules, and interfacing Python code with external C libraries.

While Cython can compile (most) regular Python code, the generated C code usually gains major (and sometime impressive) speed improvements from optional static type declarations for both Python and C types. These allow Cython to assign C semantics to parts of the code, and to translate them into very efficient C code. Type declarations can therefore be used for two purposes: for moving code sections from dynamic Python semantics into static-and-fast C semantics, but also for directly manipulating types defined in external libraries. Cython thus merges the two worlds into a very broadly applicable programming language.

HISTORY

Cython is a derivative of the Pyrex language, and supports more features and optimizations than Pyrex.

Cython was forked from Pyrex in 2007 by developers of the Sage computer algebra package, because they were unhappy with Pyrex's limitations and could not get patches accepted by Pyrex's maintainer Greg Ewing, who envisioned a much smaller scope for his tool than the Sage developers had in mind. They then forked Pyrex as SageX. When they found people were downloading Sage just to get SageX, and developers of other packages (including Stefan Behnel, who maintains LXML) were also maintaining forks of Pyrex, SageX was split off the Sage project and merged with cython-lxml to become Cython.

Python was created in the early 1990s by Guido van Rossum at Stichting Mathematisch Centrum in the Netherlands as a successor of a language called ABC. Guido remains Python’s principal

author, although it includes many contributions from others. In 1995, Guido continued his work on Python at the Corporation for National Research Initiatives in Reston, Virginia where he released several versions of the software.

In May 2000, Guido and the Python core development team moved to BeOpen.com to form the Be Open Python Labs team. In October of the same year, the Python Labs team moved to Digital Creations In 2001, the Python Software Foundation was formed, anon-profit organization created specifically to own Python-related Intellectual Property. Zope Corporation is a sponsoring

member of the PSF. All Python releases are Open Source.

FEATURES

Cython is a programming language based on Python that is translated into C/C++ code, and finally compiled into binary extension modules that can be loaded into a regular Python session. Cython extends the Python language with explicit type declarations of native C types. One can annotate attributes and function calls to be resolved at compile-time (as opposed to runtime).With the extra information from the annotations; Cython is able to generate code that sidesteps most of the usual runtime costs.

The generated code can take advantage of all the optimizations the C/C++ compiler is aware of without having to re-implement them as part of Cython. Cython integrates the C language and the Python runtime through automatic conversions between Python types and C types, allowing the programmer to switch between the two without having to do anything by hand. The same applies when calling into external libraries written in C, C++ or Fortran. Accessing them is a native operation in Cython code, so calling back and forth between Python code, Cython code and native library code is trivial.

Of course, if we’re manually annotating every variable, attribute, and return type with type information, we might as well be writing C/C++directly. Here is where Cython approach of extending the Python language really shines. Anything that Cython can’t determine statically is compiled with the usual Python semantics, meaning that you can selectively speed up only those parts of your program that expose significant execution times. The key thing to keep in mind in this context is the Pareto Principle, also known as the 80/20 rule: 80% of the runtime is spent in20% of the source code. This means that a little bit of annotation in the right spot can go a long way.

This leads to an extremely productive workflow in Cython: users can simply develop with Python, and if they find that a significant amount of times being spent paying Python overheads, they can compile parts or all of their project with Cython, possibly providing some annotations to speed up the critical parts of the code. For code that spends almost all of its execution time in libraries doing things like FFTs, matrix multiplication, or linear system solving, Cython fills the same rapid development role as Python. However, as you extend the code with new functionality and algorithms, you can do this directly in Cython – adjust by providing a little extra type information, you can get all the speed of C without all the headaches.

1.Fast array access

Fast array access, added to the Cython language by D. S. Seljebotn and R. W. Bradshaw in 2008, was an important improvement in convenience for numerical users. The work is based on PEP 3118, which defines a C API for direct access to the array data of Python objects acting as array data containers1. Cython is able to treat most of the NumPy array data types as corresponding native C types. Since Cython 0.11.2, complex floating point types are supported, either through the C99 complex types or through Cython’s own implementation. Record arrays are mapped to arrays of C struct s for efficient access. Some data types are not supported, such as string/Unicode arrays, arrays with on-native endianness and Boolean arrays. The latter can however be treated as 8-bit integer arrays in Cython contains further details.

To discuss this feature we will start with the example of naive matrix multiplication. Beginning with a pure Python implementation, we will incrementally add optimizations. The benchmarks should help Cython users decide how far one wants to go in other cases.

For clarity of exposition, this skips the details of sanity checking the arguments. In a real setting one should probably also automatically allocate out if not provided by the caller. Simply compiling this in Cython results in a about1.15x speedup over Python. This minor speedup is due to the compiled C code being faster than Python’s byte code interpreter. The generated C code still uses the Python C API, so that e.g. the array lookup A[i,k] translates into C code very similar to:

In our benchmarks this results in a speedup of between 180-190 times over pure Python. The exact factor

varies depending on the size of the matrices. If they are not small enough to be kept in the CPU cache, the data must be transported repeatedly over the memory bus. This is close to equally expensive for Python and Cython and thus tends to slightly diminish any other effects. Table 1 has the complete benchmarks, with one in-cache benchmark and one out-of-cache benchmarking every case.

Note however that the speedup does not come without some costs. First, the routine is now only usable for64-bit floating point. Arrays containing any other data type will result in a Value Error being raised. Second, it is necessary to ensure that typed variables containing Python objects are not None. Failing to do so can result in a crash or data corruption if None is passed to the routine.

The generated C source for the array lookup A[i, k]now looks like this:

This is a lot faster because there are no API calls in a normal situation, and access of the data happens directly to the underlying memory location. The initial conditional tests are there for two reasons. First, an if-test is needed to support negative indices. With the usual Python semantics, A[-1, -1] should refer to the lower-right corner of the matrix. Second, it is necessary to raise an exception if an index is out of bounds. Such if-tests can bring a large speed penalty, especially in the middle of the computational loop. It is therefore possible to instruct Cython to turn off these features through compiler directives. The following code disables support for negative indices (wraparound) and bounds checking:

This removes all the if-tests from the generated code. The resulting benchmarks indicate around 800 times speedup at this point in the in-cache situation, 700times out-of-cache. Only disabling one of either wraparound or bounds check will not have a significant impact because there will still be if-tests left.

One trade-off is that should one access the arrays out of bounds, one will have data corruption or a program crash. The normal procedure is to leave bounds checking on until one is completely sure that the code is correct, then turn it off. In this case we are not using negative indices, and it is an easy decision to turn them off. Even if we were, it would be faster to manually add code to calculate the corresponding positive index. wraparound is mainly enabled by default to reduce the number of surprises for casual Cython users, and one should rarely leave it on.

In addition to per-function level like here, compiler directives can also be specified globally for the source file or for individual code blocks. See [directives] for further information.

2.Caring about memory layout

Both C/C++ and Fortran assume that arrays are stored as one contiguous chunk in memory. NumPy arrays depart from this tradition and allows for arbitrarily stride arrays. Consider the example of B =A[::-2,:]. That is, let B be the array consisting of every other row in A, in reverse order. In many environments, the need for representing B contiguously in memory mandates that a copy is made in such situations. NumPy supports a wider range of array memory layouts and can in this situation construct B as a new view to the same data that A refers to. The benefit of the NumPy approach is that it is more flexible, and allows avoiding copying of data. This is especially important if one has huge data sets where the main memory might only be able to hold one copy at the time. With choice does however come responsibility. In order to gain top performance with numerical computations it is in general crucial to pay attention to memory layout.

In the example of matrix multiplication, the first matrixes accessed row-by-row and the second column by-column. The first matrix should thus be stored with contiguous rows, “C contiguous”, while the second should be stored with contiguous columns, “Fortran contiguous”. This will keep the distance between subsequent memory accesses as small as possible. Inane out-of-cache situation, our fastest matmul routine so far does around 700 times better than pure Python when presented with matrices with optimal layout, but only around 180 times better with the worst-case layout. See table 1.

The second issue concerning memory layout is that a price is paid for the generality of the NumPy approach: In order to address arbitrarily stride arrays, an extra integer multiplication operation must be done per access. In the matmul implementation above, A[i,k] was translated to the following C code:

A_ik = *(dtype_t*)(A_data + i * A_stride_0

+ j * A_stride_1);

By telling Cython at compile-time that the arrays are contiguous, it is possible to drop the innermost stride multiplication. This is done by using the “mode” argument to the array type:

def matmul(

np.ndarray[dtype_t, ndim=2, mode="c"] A,

np.ndarray[dtype_t, ndim=2, mode="fortran"] B,

np.ndarray[dtype_t, ndim=2] out=None):

<...>

The in-cache benchmark now indicate a 780x speedup over pure Python. The out-of-cache improvement is smaller but still noticeable. Note that no restrictions are put on the out argument. Doing so did not lead tony significant speedup as out is not accessed in thinner loop.

The catch is, of course, that the routine will now reject arrays which do not satisfy the requirements. This happens by raising a Value Error. One can make sure that arrays are allocated using the right layout bypassing the “order” argument to most NumPy array constructor functions, as well as the copy() method of NumPy arrays. Furthermore, if an array A is C contiguous, then the transpose, A.T, will be a Fortran contiguous view of the same data.

The number of memory accesses of multiplying two n×n matrices scale as O(n3), while copying the matrices scale as O(n2). One would therefore expect, given that enough memory is available, that making a temporary copy pays off once n passes a certain threshold. In this case benchmarks indicate that the threshold is indeed very low (in the range of n = 10) and one would typically copy in all situations. The NumPy functions

As contiguous array and as fortran array are helpful in such situations.

3.Calling an external library

The real world use case for Cython is to speed up custom numerical loops for which there are no prior implementations available. For a simple example like matrix multiplication, going with existing implementations is always better. For instance, NumPy’s dot function is about 6 times faster than our fastest Cython implementation, since it uses smarter algorithms. Under the hood, dot makes a call to the dgemm function in the Basic Linear Algebra Software API.One advantage of Cython is how easy it is to call native code. Indeed, for many, this is the entire point of using Cython. We will demonstrate these features by calling BLAS (BLAS is an API with many implementations; the benchmarks in this paper is based on using the open-source ATLAS implementation, custom-compiled on the host by Sage) for the inner products only, rather than for the whole matrix multiplication. This allows us to stick with the naive matrix multiplication algorithm, and also demonstrates how to mix Cython code and these of external libraries. The BLAS API must first be declared to Cython:

cdef extern from "cblas.h":

double ddot "cblas_ddot"(int N,

double *X, int incX,

double *Y, int incY)

The need to re-declare functions which are already declared in C is unfortunate and an area of possible improvement for Cython. Only the declarations that are actually used needs to be declared. Note also the use of C pointers to represent arrays. BLAS also accepts stride arrays and expects the strides passed in the incX and incY arguments. Other C APIs will often require a contiguous array where the stride is fixed to one array element; in the previous section it was discussed how one can ensure that arrays are contiguous. The matrix multiplication can now be performed like this:

ctypedef np.float64_t dtype_t

def matmul(np.ndarray[dtype_t, ndim=2] A,

np.ndarray[dtype_t, ndim=2] B,

np.ndarray[dtype_t, ndim=2] out):

cdef Py_ssize_t i, j

cdef np.ndarray[dtype_t, ndim=1] A_row, B_col

for i in range(A.shape[0]):

A_row = A[i,:]

for j in range(B.shape[1]):

B_col = B[:, j]

out[i,j] = ddot(

A_row.shape[0],

<dtype_t*>A_row.data,

A_row.strides[0] // sizeof(dtype_t),

<dtype_t*>B_col.data,

B_col.strides[0] // sizeof(dtype_t))

This demonstrates how NumPy array data can bypassed to C code. Note that NumPy strides are in number of bytes, while BLAS expects them in number of elements. Also, because the array variables are typed, the “data” attributes are C pointers rather than Python buffer objects. Unfortunately, this results in a slowdown for moderate. This is due to the slice operations. Operations like A_row = A[i,:] is a Python operation and results in Python call overhead and the construction of several new objects. Cython is unlikely to ever optimize slicing of np.ndarray variables because it should remain possible to use subclasses of ndarray with a different slicing behaviour3. The new memory view type, discussed below, represents a future solution to this problem. Another solution is to do the slice calculations manually and use C pointer arithmetic:

out[i,j] = ddot(

A.shape[1],

<dtype_t*>(A.data + i*A.strides[0]),

A.strides[1] // sizeof(dtype_t),

<dtype_t*>(B.data + j*B.strides[1]),

B.strides[0] // sizeof(dtype_t))

This version leads to over 1300 times speedup over pure Python in the optimal, in-cache situation. This is due to BLAS using the SSE2 instruction set, which enables doing two double-precision floating point multiplications in one CPU instruction. When non-contiguous arrays are used the performance drops to 970 times that of pure Python (in-cache) as SSE2 can no longer be used. For more advanced array data passing, Cython makes it easy to make use of NumPy’s C API. Consider for instance a custom C library which returns a pointer to some statically allocated data, which one would like to view as a NumPy array. Using Cython and theNumPy C API this is easily achieved:

cimport numpy as np

cdef extern from "mylib.h":

cdef int get_my_data(double** out_data,

int* out_size)

def my_data_as_ndarray():

cdef np.npy_intp* shape = [0]

cdef int arr_length

cdef double* arr_ptr

if get_my_data(&arr_ptr, &arr_length) != 0:

raise RuntimeError("get_my_data failed")

shape[0] = arr_length

return np.PyArray_SimpleNewFromData(1, shape,

np.NPY_DOUBLE, <void*>arr_ptr)

4.Calling external C functions

It is perfectly OK to do from math import sin to use Python’s sin() function. However, calling C’s own sin() function is substantially faster, especially in tight loops. It can be declared and used in Cython as follows:

cdef extern from "math.h":

double sin(double)

cdef double f(double x):

return sin(x*x)

At this point there are no longer any Python wrapper objects around our values inside of the main for loop, and so we get an impressive speedup to 219 times the speed of Python. Note that the above code re-declares the function from math.h to make it available to Cython code. The C compiler will see the original declaration in math.h at compile time, but Cython does not parse “math.h” and requires a separate definition. When calling C functions, one must take care to link in the appropriate libraries. This can be platform specific; the below example works on Linux and Mac

OS X:

from distutils. extension import Extension

from Cython.Distutils import build_ext

ext_modules=[

Extension("demo",

["demo.pyx"],

libraries=["m"]) # Unix-like specific

]

setup(

name = "Demos",

cmdclass = {"build_ext": build_ext},

ext_modules = ext_modules

)

If one uses the Sage notebook to compile Cython code, one can use a special comment to tell Sage to link in libraries:

#clib: m

Just like the sin() function from the math library, it is possible to declare and call into any C library as long as the module that Cython generates is properly linked against the shared or static library. A more extensive example of wrapping a C library is given in the section Using C libraries.

5.Using C libraries.

Apart from writing fast code, one of the main use cases of Cython is to call external C libraries from Python code. As seen for the C string decoding functions above, it is actually trivial to call C functions directly in the code. The following describes what needs to be done to use an external C library in Cython code. Imagine you need an efficient way to store integer values in a FIFO queue. Since memory really matters, and the values are actually coming from C code, you cannot afford to create and store Python int objects in a list or deque. So you look out for a queue implementation in C.

After some web search, you find the C-algorithms library and decide to use its double ended queue implementation. To make the handling easier, however, you decide to wrap it in a Python extension type that can encapsulate all memory management. The C API of the queue implementation, which is defined in the header file libcalg/queue.h, essentially looks like this:

typedef struct _Queue Queue;

typedef void *QueueValue;

Queue *queue_new(void);

void queue_free(Queue *queue);

int queue_push_head(Queue *queue, QueueValue data);

QueueValue queue_pop_head(Queue *queue);

QueueValue queue_peek_head(Queue *queue);

int queue_push_tail(Queue *queue, QueueValue data);

QueueValue queue_pop_tail(Queue *queue);

QueueValue queue_peek_tail(Queue *queue);

int queue_is_empty(Queue *queue);

To get started, the first step is to redefine the C API

in a .pxd file, say, cqueue.pxd:

cdef extern from "libcalg/queue.h":

ctypedef struct Queue:

pass

ctypedef void* QueueValue

Queue* new_queue()

void queue_free(Queue* queue)

int queue_push_head(Queue* queue, QueueValue data)

QueueValue queue_pop_head(Queue* queue)

QueueValue queue_peek_head(Queue* queue)

int queue_push_tail(Queue* queue, QueueValue data)

QueueValue queue_pop_tail(Queue* queue)

QueueValue queue_peek_tail(Queue* queue)

bint queue_is_empty(Queue* queue)

6.Unicode and passing strings

Similar to the string semantics in Python 3, Cython also strictly separates byte strings and Unicode strings. Above all, this means that there is no automatic conversion between byte strings and Unicode strings (except for what Python 2 does in string operations). All encoding and decoding must pass through an explicit encoding/decoding step.

It is, however, very easy to pass byte strings between C code and Python. When receiving a byte string from a C library, you can let Cython convert it into a Python byte string by simply assigning it to a Python variable:

cdef char* c_string = c_call_returning_a_c_string()

py_string = c_string

This creates a Python byte string object that holds a copy of the original C string. It can be safely passed around in Python code, and will be garbage collected when the last reference to it goes out of scope. To convert the byte string back into a C char*, use the opposite assignment:

cdef char* other_c_string = py_string

This is a very fast operation after which other_c_string points to the byte string buffer of the Python string itself. It is tied to the life time of the Python string. When the Python string is garbage collected, the pointer becomes invalid. It is therefore important to keep a reference to the Python string as long as the char* is in use. Often enough, this only spans the call to a C function that receives the pointer as parameter. Special care must be taken, however, when the C function stores the pointer for later use. Apart from keeping a Python reference to the string, no manual memory management is required. The above way of passing and receiving C strings is as simple that, as long as we only deal with binary data in the strings. When we deal with encoded text, however, it is best practice to decode the C byte strings to Python Unicode strings on reception, and to encode Python Unicode strings to C byte strings on the way out.

With a Python byte string object, you would normally just call the .decode() method to decode it into a Unicode string:

ustring = byte_string.decode(’UTF-8’)

You can do the same in Cython for a C string, but the generated code is rather inefficient for small strings. While Cython could potentially call the Python C-API function for decoding a C string from UTF-8 to Unicode(PyUnicode_DecodeUTF8()), the problem is that this requires passing the length of the C string, which Cython cannot know at compile time nor runtime. So it would have to call strlen() first, although the user code will already know the length of the string in almost all cases. Also, the encoded byte string might actually contain null bytes, so this isn’t even a safe solution. It is therefore currently recommended to call the API functions directly:

# .pxd file that comes with Cython

cimport python_unicode

cdef char* c_string = NULL

cdef Py_ssize_t length = 0

# get pointer and length from a C function

get_a_c_string(&c_string, &length)

# decode the string to Unicode

ustring = python_unicode.PyUnicode_DecodeUTF8(

c_string, length, ’strict’)

7.Parallel computation

When discussing parallel computation there is an important distinction between shared-memory models and message passing models. We start with discussing the shared memory case. A common approach in parallel numerical code is to use Open MP. While not difficult to support in principle, Open MP is currently not available in Cython. Instead, Python threads must be used. This comes with some problems, but they can be worked around.

Threads is a problem with C Python because every operation involving a Python object must be done while holding the Global Interpreter Lock (GIL). The result is that pure Python scripts are typically unable to utilize more than one CPU core, even if many threads are used. It should be noted that for many computational scripts this does not matter. If the bulk of the computation happens in wrapped, native code (like in the case of NumPy or SciPy) then the GIL is typically released during the computation. For Cython the situation is worse. Once inside Cython code, the GIL is by default held until one returns to the Python caller. The effect is that threads doesn’t switch at all. Whereas a pure Python script will tend to switch between threads on a single CPU core, a Cython program will by default tend to not switch threads at all. The solution is to use Cython language constructs to manually release the GIL. One can then achieve proper multi-threading on many cores. The catch is that no Python operations are allowed when the GIL is released; for instance, all variables used must be typed with a C type. Optimized NumPy array lookups are allowed. The Cython compiler will help enforce these

rules. Example:

@cython.boundscheck(False)

def find_first(np.ndarray[np.int64_t] haystack,

np.int64_t needle):

cdef Py_ssize_t i, ret = -1

with nogil:

for i from 0 <= i < haystack.shape[0]:

if haystack[i] == needle:

ret = i; break

return ret

Without nogil, invocations from separate threads would be serialized. Returning the result is a Python operation, so that has to be put outside of the nogil block .Furthermore, bounds check must be turned off as raising an Index Error would require the GIL4.contains further information on the various primitives for managing the GIL (search for “nogil”).The message passing case is much simpler. Several Python interpreters are launched, each in its own process, so that the GIL is not an issue. A popular approach is mpi4py together with an MPI implementation(such as Open MPI mpi4pyvery conveniently allows passing full Python objects between computational nodes through Python pickling5It is also possible to efficiently pass NumPy arrays. mpi4py is itself written in Cython, and ships with the Cython compile-time definitions necessary to communicate directly with the underlying MPI C API. It is thus possible to use both interchangeably:

from mpi4py import MPI

from mpi4py cimport MPI

from mpi4py cimport mpi_c

cdef MPI.Comm comm = MPI.COMM_WORLD

if comm.Get_rank() == 0:

# High-level send of Python object

comm.send({’a’: any_python_object, ’b’: other},

to=1)

for <...a lot of small C-typed messages...>:

# Fast, low-level send of typed C data

mpi_c.MPI_Send(<...>, comm.ob_mpi)

elif ...

This is useful e.g. in situations where one wants to pass typed Cython variables and does not want to bother with conversion back and forth to Python objects. This also avoids the overhead of the Python call to mpi4py (although in practice there is likely to be other, larger bottlenecks). While contrived, this example should demonstrate some of the flexibility of Cython with regards to native libraries. mpi4py can be used for sending higher-level data or a few big messages, while the C API can be used for sending C data or many small messages.

The same principle applies to multi-threaded code: It is possible, with some care, to start the threads through the Python API and then switch to e.g. native OS thread mutexes where any Python overhead would become too large. The resulting code would however be platform-specific as Windows and Unix-based systems have separate threading APIs.

Some typical use cases

Cython has been successfully used in a wide variety of situations, from the half a million lines of Cython code in Sage (http://www.sagemath.org),to providing Python-friendly wrappers to C libraries, to small personal projects. Here are some example use cases demonstrating where Cython has proved valuable.

Sparse matrices

SciPy and other libraries provide the basic high level operations for working with sparse matrices. However, constructing sparse matrices often follows complicated rules for which elements are nonzero. Such code can seldomly be expressed in terms of NumPy expressions – the most naive method would need temporary arrays of the same size as the corresponding dense matrix, thus defeating the purpose! Cython is ideal for this, as we can easily and these cases, one might want to explore calling existing

C or Fortran code directly from Cython. Some libraries have ready-made Cython wrappers– for instance, Sage has Cython wrappers around the ordinary differential equation solvers in the GNU Scientific Library. In some cases, one might opt for implementing the algorithm directly in Cython, to avoid any callback whatsoever – using

Newton’s method on equations of a single variable comes to mind.

Non-rectangular arrays and data repacking

Sometimes data does not fit naturally in rectangular arrays, and Cython is especially well-suited to this situation. One such example arises in cosmology. Satellite experiments such as the Wilkinson Microwave Anisotropy Probe have produced high resolution images of the Cosmic Microwave Background, a primary source of information about the early universe. The resulting images are spherical, as they contain values for all directions on the sky. The spherical harmonic transform of these maps, a “Fourier transform on the sphere”, is especially important. It has complex coefficients a`m where the indices run over 0 _ ` _ `max, −` _ m _ `.

An average of the entire map is stored in a0,0,followed by three elements to describe the dipole

component, a1,−1, a1,0, a1,1, and so on. Data like this can be stored in a one-dimensional array and elements looked up at position `2 + ` + m. It is possible, but not trivial, to operate on such data using NumPy whole-array operations. The problem is that NumPy functions, such as finding the variance, are primarily geared towards rectangular arrays. If the data was rectangular, one could estimate the variance per `, averaging over m, by calling np.var(data, axis=1). This doesn’t work for non-rectangular data. While there are workarounds, such as the reduce at method and masked arrays, we have found it much more straightforward to write the obvious loops over ` and m using Cython. For comparison, with Python and NumPy one could loop over ` and call repeatedly call np.var for sub slices of the data, which was 27 times slower in our case (`max = 1500). Using a naive double loop over both ` and m was more than a 1000times slower in Python than in Cython. (Incidentally, the variance per `, or power spectrum, is the primary quantity of interest to observational cosmologists.)The spherical harmonic transform mentioned

above is computed using the Fortran libraryHEALPix3, which can readily be called from Cython with the help of Fwrap. However, HEALPix spits out the result as a 2D array, with roughly half of the elements unoccupied. The waste of storage aside, 2D arrays are often inconvenient– with 1D arrays one can treat each set of coefficients as a vector, and perform linear algebra, estimate covariance matrices and so on, the usual way. Again, it is possible to quickly reorder the data the way we want it with a Cython loop. With all the existing code out there wanting data in slightly different order and formats, for loops are not about to disappear

Fwrap

Whereas C and C++ integrate closely with Cython , Fortran wrappers in Cython are generated

with Fwrap, a separate utility that is distributed separately from Cython. Fwrap is a tool that automates wrapping Fortran source in C, Cython and Python, allowing Fortran code to benefit from the dynamism and flexibility of Python. Fwrapped code can be seamlessly integrated into a C, Cython or Python project. The utility transparently supports most of the features introduced in Fortran 90/95/2003, and will handle nearly all Fortran 77 source as well. Fwrap does not currently support derived types or function callbacks, but support for these features is scheduled in an upcoming release. Thanks to the C interoperability features supplied

in the Fortran 2003 standard – and supported in recent versions of all widely-used Fortran 90/95

compilers – Fwrap generates wrappers that are portable across platforms and compilers. Fwrap is intended to be as friendly as possible, and handles the Fortran parsing and generation automatically. It also generates a build script for the project that will portably build a Python extension module from the wrapper files.

DATA TYPES IN CYTHON

Cython is a Python compiler. This means that it can compile normal Python code without changes (with a few obvious exceptions of some as-yet unsupported language features). However, for performance-critical code, it is often helpful to add static type declarations, as they will allow Cython to step out of the dynamic nature of the Python code and generate simpler and faster C code - sometimes faster by orders of magnitude.

It must be noted, ho verbose and thus less readable. It is therefore discouraged to use them without good reason, such as where benchmarks prove that really make the code substantially faster in a performance critical section. Typically a few types in the right spots go a long way. Cython can produce annotated output (see figure 2 below) that can be very useful in determining where to add types.

All C types are available for type declarations: integer and floating point types, complex numbers, structs, unions and pointer types. Cython can automatically and correctly convert between the types on assignment. This also includes Python’s arbitrary size integer types, where value overflows on conversion to a C type will raise a Python Overflow Error at runtime. The generated C code will handle the platform dependent

sizes of C types correctly and safely in this case.

Figure 2 Using the -a switch to the Cython command line program (or following a link from the Sage notebook) results in an HTML report of Cython code interleaved with the generated C code. Lines are colored according to the level of “typedness” - white lines translates to pure C without any Python API

calls. This report is invaluable when optimizing a function for speed.

cdef functions

Python function calls can be expensive, and this is especially true in Cython because one might need to convert to and from Python objects to do the call. In our example above, the argument is assumed to be a double both inside f() and in the call to it, yet a Python float object must be constructed around the argument in order to pass it. Therefore Cython provides a syntax for declaring a C style function, the cdef keyword:

cdef double f(double) except *: return sin(x**2)

Some form of except-modifier should usually be added, otherwise Cython will not be able to propagate exceptions raised in the function (or a function it calls). Above except * is used which is always safe. An except clause can be left out if the function returns a Python object or if it is guaranteed that an exception will not be raised within the function call.

A side-effect of cdef is that the function is no longer available from Python-space, as Python wouldn’t know how to call it. Using the cpdef keyword instead of cdef, a Python wrapper is also created, so that the function is available both from Cython (fast, passing typed values directly) and from Python (wrapping values in Python objects). Note also that it is no longer possible to change f at runtime. Speedup: 45 times over pure Python.

pxd files

In addition to the .pyx source files, Cython uses .pxd files which work like C header files - they contain Cython declarations (and sometimes code sections) which are only meant for sharing C-level declarations with other Cython modules. A pxd file is imported

into a pyx module by using the cimport keyword. pxd files have many use-cases:

1. They can be used for sharing external C declarations.

2. They can contain functions which are well suited for in lining by the C compiler. Such functions should be marked inline, example:

cdef inline int int_min(int a, int b):

return b if b < a else a

3. When accompanying an equally named pyx file, they provide a Cython interface to the Cython module so that other Cython modules can communicate with it using a more efficient protocol than the Python one.

In our integration example, we might break it up into pxd files like this:

1. Add a cmath.pxd function which defines the C functions available from the C math.h header file, like sin. Then one would simply do from cmath import sin in integrate.pyx.

2. Add a integrate.pxd so that other modules written in Cython can define fast custom functions to integrate.

cdef class Function:

cpdef evaluate(self, double x)

cpdef integrate(Function f, double a,

double b, int N)

Note that if you have a cdef class with attributes, the attributes must be declared in the class declaration pxd file (if you use one), not the pyx file. The compiler will tell you about this.

Using Cython with NumPy

Cython has support for fast access to NumPy arrays. To optimize code using such arrays one must cimport the NumPy pxd file (which ships with Cython), and declare any arrays as having the ndarray type. The data type and number of dimensions should be fixed at compile-time and passed. For instance:

import numpy as np

cimport numpy as np

def myfunc(np.ndarray[np.float64_t, ndim=2] A):

<...>

myfunc can now only be passed two-dimensional arrays containing double precision floats, but array indexing operation is much, much faster, making it suitable for numerical loops. Expect speed increases well over 100 times over a pure Python loop; in some cases the speed increase can be as high as 700 times or more.

Fast array declarations can currently only be used with function local variables and arguments to defstyle functions (not with arguments to cpdef or cdef, and neither with fields in cdef classes or as global variables). These limitations are considered known defects and we hope to remove them eventually. In most circumstances it is possible to work around these limitations rather easily and without a significant speed penalty, as all NumPy arrays can also be passed as untyped objects.

Array indexing is only optimized if exactly as many indices are provided as the number of array dimensions. Furthermore, all indices must have a native integer type. Slices and NumPy “fancy indexing” is not optimized.

Installing Cython

Many scientific Python distributions, such as the En thought Python Distribution [EPD], Python(x,y)[Pythonxy], and Sage [Sage], bundle Cython and no setup is needed. Note however that if your distribution ships a version of Cython which is too old you can still use the instructions below to update Cython. Everything in this tutorial should work with Cython0.11.2 and newer, unless a footnote says otherwise. Unlike most Python software, Cython requires a C compiler to be present on the system. The details of getting a C compiler varies according to the system used:

• Linux The GNU C Compiler (gcc) is usually present, or easily available through the

package system. On Ubuntu or Debian, for instance, the command sudo apt-get install

build-essential will fetch everything you need.

• Mac OS X To retrieve gcc, one option is to install Apple’s XCode, which can be retrieved from the MacOS X’s install DVDs or from http://developer.apple.com.

• Windows A popular option is to use the open source MinGW (a Windows distribution of gcc). See the appendix for instructions for setting up MinGW manually. EPD and Python(x,y) bundle MinGW, but some of the configuration steps in the appendix might still be necessary. Another option is to use Microsoft’s Visual C. One must then use the same version which the installed Python was compiled with. The newest Cython release can always be downloaded from http://cython.org. Unpack the tar ball or zip file, enter the directory, and then run:

If you have Python setup tools set up on your system, you should be able to fetch Cython from PyPI and install it using:

easy_install cython

For Windows there is also an executable installer available for download.

Building Cython code

Cython code must, unlike Python, be compiled. This happens in two stages:

• A .pyx file is compiled by Cython to a .c file, containingthe code of a Python extension module

• The .c file is compiled by a C compiler to a .sofile (or .pyd on Windows) which can be import-ed directly into a Python session. There are several ways to build Cython code:

• Write a distutils setup.py.

• Use pyximport, importing Cython .pyx files as ifthey were .py files (using distutils to compile and build the background).

• Run the cython command-line utility manually to produce the .c file from the .pyx file, then manually compiling the .c file into a shared object library or .dll suitable for import from Python. (This is mostly for debugging and experimentation.)

• Use the [Sage] notebook which allows Cython code inline and makes it easy to experiment with Cython code without worrying about compilation details (see figure 1 below). Currently, distutils is the most common way Cython files are built and distributed.

Building a Cython module using distutils

Imagine a simple “hello world” script in a file hello.pyx:

def say_hello_to(name):

print(Hello %s!" % name)

The following could be a corresponding setup.py script:

from distutils.core import setup

from distutils.extension import Extension

from Cython.Distutils import build_ext

ext_modules = [Extension("hello", ["hello.pyx"])]

setup(

name = ’Hello world app’,

cmdclass = {’build_ext’: build_ext},

ext_modules = ext_modules

)

To build, run python setup.py build_ext --inplace. Then simply start a Python session and do from hello import say_hello_to and use the imported function as you see fit.

Figure 1 The Sage notebook allows transparently editing and compiling Cython code simply by typing%cython at the top of a cell and evaluate it. Variablesand functions defined in a Cython cell imported intothe running session.

However, that type declarations can make the source code more and floating point types, complex numbers, structs, unions and pointer types. Cython can automatically and correctly convert between the types on assignment. This also includes Python’s arbitrary size integer types, where value overflows on conversion to a C type will raise a Python OverflowError at runtime. The generated C code will handle the platform dependent sizes of C types correctly and safely in this case.

ADVANTAGES

The special advantage of this seamless Python/C intermix approach is that existing Python code can be tuned to almost the speed of C by just adding a few static type declarations and by making some adaptations in critical loops - without requiring complicated interface code. The coding speed and the readability of the code remains very Pythonic. Due to reduced overhead in control structures (especially loops), optimistic optimisations and (limited) type inference, Cython compiled Python code usually executes faster than in the CPython 2.6.x interpreter, although the absolute improvements largely depend on the code. With type declarations, the typical speed-up for numerical/array computations is about 100x-1000x. In comparison, the typical speed-up with Psyco(Python JIT compiler) is about 4x-100x.

Cython has been optimised for low call overhead, so a Python call into external C/C++ code through a Cython wrapper is usually faster than with most other wrapping solutions.

LIMITATION

When compared to writing code in pure Python, Cython’s primary disadvantages are compilation time and the need to have a separate build phase. Most projects using Cython are therefore written in a mix of Python and Cython, as Cython sources don’t need to be recompiled when Python sources change. Cython can still be used to compile some of the Python modules for performance reasons. There is also an experimental “pure “mode where decorators are used to indicate static type declarations, which are valid Python and ignored by the interpreter at runtime, but are used by Cython when compiled. This combines the advantage of a fast edit-run cycle with a high runtime performance of the final product. There is also the question of code distribution. Many projects, rather than requiring Cython as a dependency, ship the generated .c files which compile against Python 2.3 to 3.2 without any modifications as part of the disputes setup phase.

Compared to compiled languages such as Fortran and C, Cython’s primary limitation is the limited support for shared memory parallelism. Python is inherently limited in its multithreading capabilities, due to the use of a Global Interpreter Lock (GIL). Cython code can declare sections as only containing C code (using a nogil directive),which are then able to run in parallel. However, this can quickly become tedious. Currently there’s also no support for Open MP programming in Cython. On the other hand, message passing parallelism using multiple processes, for instance through MPI, is very well supported.

Compared to C++, a major weakness is the lack of built-in template support, which aids in writing code that works efficiently with many different data types. In Cython, one must either repeat code for each data type, or use an external templating system, in the same way that is often done for Fortran codes. Many template engines exists for Python, and most of them should work well for generating Cython code.

Using a language which can be either dynamic or static takes some experience. Cython is clearly useful when talking to external libraries, but when is it worth it to replace normal Python code with Cython code? The obvious factor to consider is the purpose of the code – is it a single experiment, for which the Cython compilation time might overshadow the pure Python run time? Or is it a core library function, where every ounce of speed matters?

It is possible to paint some broad strokes when it comes to the type of computation considered. Is the bulk of time spent doing low-level number crunching in your code, or is the heavy lifting done through calls to external libraries? How easy is it to express the computation in terms of Num Py operations? For sequential algorithms such as equation solving and statistical simulations it is indeed impossible to do without a loop of some kind. Pure Python loops can be very slow; but the impact of this still varies depending on the use case.

CONCLUSIONS AND FUTURE WORK

While the examples shown have been simple and in part contrived, they explore fundamental properties of loops and arrays that should apply in almost any real world computation. For computations that can only be expressed as for-loops, and which is not available in a standard library, Cython should be a strong candidate. Certainly, for anything but very small amounts of data, a Python loop is unviable. The choice stands between Cython and other natively compiled technologies. Cython may not automatically produce quite as optimized code as e.g. Fortran, but we believe it is fast enough to still be attractive in many cases because of the high similarity with Python. With Cython and NumPy, copying of non-contiguous arrays is always explicit, which can be a huge advantage compared to some other technologies (like Fortran) when dealing with very large data sets.

For the algorithms which are expressible as NumPy operations, the speedup is much lower, ranging from no speedup to around ten times. The Cython code is usually much more verbose and requires more decisions to be made at compile-time. Use of Cython in these situations seems much less clear cut. A good approach is to prototype using pure Python, and, if it is deemed too slow, optimize the important parts after benchmarks or code profiling.

Cython remains in active development. Because of the simple principles involved, new features are often easy to add, and are often the result of personal itch scratching.

Sometimes the experience has been that it is quicker to add a feature to Cython than to repeatedly write code to work around an issue. Some

highlights of current development:

• Support for function-by-function profiling through the Python cProfile module was added in 0.11.3.

• Inner functions (closures) are maturing and will be released soon.

• Cython benefited from two Google Summer of Code [GSoC] projects over summer of 2009, which will result in better support for calling C++ and Fortran code.

One important and often requested feature for numerical users is template support. This would make it possible to make a single function support all array data types, without code duplication. Other possible features are improved parallel programming support, like OpenMP primitives. While no work is currently going on in these areas, the Cython developers remain conscious about these shortcomings. One important future feature for numerical users is the new memory view type. K. W. Smith and D. S. Seljebotn started work on this in summer 2009 as part of Smith’s Google Summer of Code.

The new Python buffer protocol based on PEP 3118 promise a shift in focus for array data. In Python 2.6 and greater, any Python object can export any array like data to natively compiled code (like Cython code) in an efficient and standardized manner. In some respects this represents adoption of some NumPy functionality into the Python core. With this trend, it seems reasonable that Cython should provide good mechanisms for working with PEP 3118 buffers independently of NumPy. Incidentally, this will also provide a nice unified interface for interacting with C and Fortran arrays in various formats. Unlike NumPy, PEP 3118 buffers also supports pointer indirection style arrays, sometimes used in C libraries.

With this new feature, the matrix multiplication routine could have been declared as:

def matmul(double[:,:] A,

double[:,:] B,

double[:,:] out):

<...>

Using this syntax, a buffer is acquired from the arguments on entry. The interface of the argument variables are entirely decided by Cython, and it is not possible to use Python object operations. NumPy array methods like A.mean() will therefore no longer work. Instead, one will have to call np.mean(A) (which will work once NumPy supports PEP 3118). The advantage is that when Cython defines the interface, further optimizations can be introduced. Slices and arithmetic operations are not currently subject for optimization because of polymorphism. For instance, it would currently be impossible for Cython to optimize the multiplication operator, as it means different things for ndarray and its subclass matrix.

Engineering Seminar Topics :: Seminar Paper

Cython - A programming language

No comments:

Post a Comment

Seminar Topics