Good evening, nice to meet you. (゜ ∀ ゜) o 彡 ° pyston! pyston! !!
Pyston is a Python 2.7 compatible processing system developed by Dropbox. JIT compilation using LLVM is expected to speed up python, and its rival is pypy. It seems that it only supports x86_64 right now, and ubuntu14 is recommended if you want to try it.
Click here for details on how to build. http://qiita.com/Masahito/items/edd028ebc17c9e6b22b0
Pyston released v0.2 on 2014/09 and seems to be working on the development of v0.3.
At 0.3, performance is being improved for actual benchmarks.
The pyston repository contains representative benchmarks If it was already built, I was able to run the benchmark with make run_TESTNAME.
minibenchmarks
allgroup.py fannkuch.py go.py interp2.py nbody_med.py raytrace.py
chaos.py fannkuch_med.py interp.py nbody.py nq.py spectral_norm.py
microbenchmarks
attribute_lookup.py fib2.py listcomp_bench.py repatching.py vecf_add.py
attrs.py function_calls.py nested.py simple_sum.py vecf_dot.py
closures.py gcj_2014_2_b.py polymorphism.py sort.py
empty_loop.py gcj_2014_3_b.py prime_summing.cpp thread_contention.py
fib.py iteration.py prime_summing.py thread_uncontended.py
fib.pyc lcg.py pydigits.py unwinding.py
I investigated the features of the JIT compiler while looking at the Pyston README.
https://github.com/dropbox/pyston
I think the highlight of Pyston is JIT compilation using LLVM. It uses LLVM as a JIT compiler, and I think the most famous one is FTL JIT of JavaScriptCore.
The explanation of FTL JIT here will be helpful. http://blog.llvm.org/2014/07/ftl-webkits-llvm-based-jit.html
JSC performs 4-layer JIT compilation.
Pyston also does 4-layer JIT compilation. It seems to use Pypa's Parser and AST from Pyston 0.3 corresponding to JSC's JSBytecode.
JIT compilation by LLVM is performed in the 2-3-4th layer. In the second layer, we embed the code that collects the type at runtime without optimizing with LLVM. In the 4th layer, TypeSpeculation is performed based on the Type collected at runtime, and optimization is performed by LLVM. Generates fast code.
The target of the 4th layer is a loop that is executed 10,000 times or more, or a function that is called 10,000 times or more.
In the future, the third layer will be deleted. It seems that I want to replace the LLVM-IR interpreter in the first layer with my own implementation.
patchpoint seems to use LLVM's intrinsics, Is stackmaps an original implementation?
Inlining
It seems that python methods are inlined in a timely manner when JIT is compiled.
Also, the basic operations (boxing / unboxing) and collections (list / dict / tuple / xrange) that are frequently required in the runtime are It seems that bitcode is generated when Pyston is compiled and inlining is performed at the bitcode level when JIT is compiled.
This area is a little characteristic, and Inliner seems to have created Pyston's own Pass while modifying LLVM's.
See codegen / opt / inliner and runtime / inline (this is the bitcode generated collection)
inline cache
RuntimeIC(void*addr, int num_slots, int slot_size) Manage dictionaries with. The dictionary itself is ICInfo
The call is from a class that inherits RuntimeIC, call ()
The call itself became a template.
:lang:src/runtime/ics.cpp
template <class... Args> uint64_t call_int(Args... args) {
return reinterpret_cast<uint64_t (*)(Args...)>(this->addr)(args...);
}
template <class... Args> bool call_bool(Args... args) {
return reinterpret_cast<bool (*)(Args...)>(this->addr)(args...);
}
template <class... Args> void* call_ptr(Args... args) {
return reinterpret_cast<void* (*)(Args...)>(this->addr)(args...);
}
template <class... Args> double call_double(Args... args) {
return reinterpret_cast<double (*)(Args...)>(this->addr)(args...);
}
See below for details src/runtime/ics src/asm_writing/icinfo
hidden class
Is it equivalent to the hidden class of V8?
For some reason, it inherits ConservativeGCObject, and many methods are for GC. It's a mystery
For Python, do we need a hidden class to absorb the differences in attributes? I'm not sure about the Python specifications, is the difference in attributes a problem?
See below for details src/runtime/objmodel
Type feedback
Embed code that collects run-time type information when compiling a 2-3 layer JIT.
Basically, at runtime we retrieve the Cls field of BoxedClass and Emit asm to be recorded in recorder at the time of JIT compilation of 2-3 layers. It seems that the one whose number of records is 100 or more at the time of JIT compilation is adopted as the type prediction result.
The type prediction result is set to CompileType, and it is specialized from dynamic type to CompileType at JIT compilation. At that time, it seems that we will actively try to unbox from Boxed Class to Compile Type.
speculation is src / analysis / type_analysis recorder is src / codegen / type_recording Record processing in runtime is src / runtime / objmodel
Object representation
All Instances handled by Pyston seem to be in a boxed state. Therefore, the cls field is filled in at the beginning, and BoxeInt stores the value of int64_t as an example.
The list of Boxed Classes probably looks like this.
:lang:
BoxedClass* object_cls, *type_cls, *bool_cls, *int_cls, *long_cls, *float_cls, *str_cls, *function_
*none_cls, *instancemethod_cls, *list_cls, *slice_cls, *module_cls, *dict_cls, *tuple_cls, *file_cls,
*member_cls, *method_cls, *closure_cls, *generator_cls, *complex_cls, *basestring_cls, *unicode_cls, *
*staticmethod_cls, *classmethod_cls;
It seems that only boxed objects can be stored in various collections (list and dict) and args, The process of extracting the actual value from it seems to be as specialized as possible.
Therefore, based on the result of type feedback, the runtime type is inferred from the type of args of the collected function. It seems to eliminate boxed objects as much as possible and insert unboxed objects in a timely manner.
For type speculation, refer to src / analysis, for Box, refer to src / runtime / classobj and its derived classes.
Optimize
LLVM optimization seems to generate fast code.
:lang:src/codegen/irgen.cpp
doCompile()
CompiledFunction()
if (ENABLE_SPECULATION && effort >= EffortLevel::MODERATE)
doTypeAnalysis()
BasicBlockTypePropagator::propagate()
optimizeIR() /*Set LLVM optimizations with LLVM PassManager*/
makeFPInliner() /*pyston's own My Inlining Pass*/
EscapeAnalysis() /*Escape Analysis, a proprietary implementation of pyston*/
createPystonAAPass() /*Update AA results with reference to Escape analysis results*/
createMallocsNonNullPass() /* (malloced != NULL)Seems to remove*/
createConstClassesPass()
createDeadAllocsPass() /*Remove alloc that does not escape*/
The main control is src / codegen / irgen.cpp Speculation system is analysis See codegen / opt for proprietary LLVM optimization paths
I wondered if EscapeAnalysis should be replaced with alloc-> stack allocation, For LLVM's ModRef analysis, it seemed to just feed back a NoEscape reference as NoModRef.
Does LLVM's ScalarReplAggregates refer to NoModRef and replace it with Alloca?
DeadAllocsPass analyzes the load / store reference by referring to the AA result and removes unnecessary Load. * / After that, dce of LLVM may remove the instruction equivalent to alloca.
http://blog.pyston.org/2014/11/06/frame-introspection-in-pyston/ In the blog, it seems that local variables are assigned to stack.
C API native extension
From v0.2, it seems to support the extension of C_API. As a sample code, it was in test / test_extension. See src / capi
:lang:test/basic_test.c
static PyObject *
test_load(PyObject *self, PyObject *args)
{
if (!PyArg_ParseTuple(args, ""))
return NULL;
assert(stored);
Py_INCREF(stored);
return stored;
}
static PyMethodDef TestMethods[] = {
{"store", test_store, METH_VARARGS, "Store."},
{"load", test_load, METH_VARARGS, "Load."},
{NULL, NULL, 0, NULL} /* Sentinel */
};
PyMODINIT_FUNC
initbasic_test(void)
{
PyObject *m;
m = Py_InitModule("basic_test", TestMethods);
if (m == NULL)
return;
}
The material is detailed in this IBM Python JIT compiler, so please refer to it. I tried to find out what kind of part is slow in python.
http://www.cl.cam.ac.uk/research/srg/netos/vee_2012/slides/vee18-ishizaki-presentation.pdf
It wasn't progressing, so it was just a memo. It would be great if you could get a feel for how it feels compared to other Python processing systems and JavaScript JIT processing systems (V8, FTL JIT, etc.) that Pyston may be referring to.
It seems that Pyston plans to incorporate pypy benchmarks while comparing PyPy and Cpython. Compared to the current Pyston, PyPy is too fast and the mechanism is fundamentally different. I have high expectations for Pyston in the future.