I am making a gem called virtual_module
that can call Python and Julia packages from Ruby. In the example below, the part of reading the manpage of some commands as a document is written in Ruby, and the part to be processed by doc2vec is called Python and left to gensim.
doc2vec.rb
require 'natto'
manpages={}
natto = Natto::MeCab.new
%w"ps ls cat cd top df du touch mkdir".each do |cmd|
list = []
natto.parse(`man #{cmd} | col -bx | cat`) do |n|
list << n.surface
end
manpages[cmd] = list
end
require 'virtual_module'
py = VirtualModule.new(:methods=><<EOS, :python=>["gensim"])
class LabeledListSentence(object):
def __init__(self, words_list, label_list):
self.words_list = words_list
self.label_list = label_list
def __iter__(self):
for i, words in enumerate(self.words_list):
yield gensim.models.doc2vec.LabeledSentence(words, [self.label_list[i]])
EOS
model = py.gensim.models.doc2vec.Doc2Vec(py.LabeledListSentence(manpages.values, manpages.keys), min_count:0)
p model.docvecs.most_similar(["ps"]) # [["top", 0.5594387054443359], ["cat", 0.46929454803466797], ["df", 0.3900265693664551], ["mkdir", 0.38811227679252625], ["du", 0.23663029074668884], ["ls", 0.15436093509197235], ["cd", -0.1965409815311432], ["touch", -0.38958919048309326]]
I used this to add a function to extract related articles using doc2vec that runs on Ruby to my blog (made by Sinatra), but it was a little convenient. I don't know how many people will be happy other than me, but (although it's quite inconvenient) scikit-learn will also be available, and I think it's interesting depending on how you use it, so write down the expected usage etc. To go.
In addition to doc2vec, examples of using scikit-learn are summarized in Personal blog. -from-ruby /) So if you are interested, please check it out.
Here, I will use REPL to write how the Virtual Module works internally. It is assumed that the following is already installed on your system:
--virtual_module
gem (v0.3.0 or higher)
--Python execution environment
First, launch irb.
debussy:~ remore$ irb -r virtual_module
irb(main):001:0> po = VirtualModule.new(:python=>["sklearn"=>"datasets"])
=> #<Module:0x007fb7e1aee818>
Calling VirtualModule # new
launches a Python (or Julia) process behind the scenes. When the background job finishes successfully launching, VirtualModule returns a new Module instance. From now on, we will communicate with the background via this Module instance (≒ this instance behaves like a proxy). For convenience, we'll call this a proxy object.
irb(main):002:0> py.int(2.3)
=> 2
irb(main):003:0> po.unknown_method(2.3)
RuntimeError: An error occurred while executing the command in python process: ,name 'unknown_method' is not defined
The behavior of proxy objects is very simple. In the above example, the proxy object receives a method call called ʻint (2.3) and passes it to the background job as is (at this time, msgpack is used to convert the value). As a result, the Fixnum type value
2is output to the terminal, which is returned from the background job. Since data conversion is only using msgpack, the values that can be converted to each other also conform to the [msgpack specifications](https://github.com/msgpack/msgpack/blob/master/spec.md). If an undefined method is called on the background job side, as in the
po.unknown_method (2.3)` example, an error will be displayed. Basically, the above is all the operation of Virtual Module.
I think that there are some places where this alone does not make sense, so I will add a little more.
irb(main):004:0> po.datasets
=> #<Module:0x007ffd0906c030>
irb(main):005:0> po.datasets.load_iris(:_)
=> #<Module:0x007ffd09074500>
irb(main):006:0> po.datasets.load_iris(:_).vclass
=> "<class 'sklearn.datasets.base.Bunch'>"
irb(main):007:0> po.datasets.load_iris(:_).data[1].to_a
=> [4.9, 3.0, 1.4, 0.2]
See this example to see how it works when values that cannot actually be converted by msgpack are used. In this example, the proxy object (the local variable po
here) first returns a new proxy object (# <Module: 0x007ffd0906c030>
) to the call to the # datasets
method, but after that #load_iris (: _)
is also returning another proxy object ( # <Module: 0x007ffd09074500>
). Since datasets is a module type object on Python, and load_iris (: _) is an instance of the 'sklearn.datasets.base.Bunch'
class, neither can be converted via msgpack, so the Module instance is Has been generated. For calls that cannot be converted by mspgack in this way, the virtualModule does not pass the actual value, but only a pointer to that value.
irb(main):008:0> po.datasets.vclass
=> "<type 'module'>"
irb(main):009:0> iris = po.datasets.load_iris(:_)
=> #<Module:0x007ffd09057568>
irb(main):010:0> iris.target.vclass
=> "<type 'numpy.ndarray'>"
irb(main):011:0> iris.target.vmethods
=> ["T", "__abs__", "__add__", "__and__", "__array__", "__array_finalize__", "__array_interface__", "__array_prepare__", "__array_priority__", "__array_struct__", "__array_wrap__", "__class__", "__contains__", "__copy__", "__deepcopy__", "__delattr__", "__delitem__", "__delslice__", "__div__", "__divmod__", "__doc__", "__eq__", "__float__", "__floordiv__", "__format__", "__ge__", "__getattribute__", "__getitem__", "__getslice__", "__gt__", "__hash__", "__hex__", "__iadd__", "__iand__", "__idiv__", "__ifloordiv__", "__ilshift__", "__imod__", "__imul__", "__index__", "__init__", "__int__", "__invert__", "__ior__", "__ipow__", "__irshift__", "__isub__", "__iter__", "__itruediv__", "__ixor__", "__le__", "__len__", "__long__", "__lshift__", "__lt__", "__mod__", "__mul__", "__ne__", "__neg__", "__new__", "__nonzero__", "__oct__", "__or__", "__pos__", "__pow__", "__radd__", "__rand__", "__rdiv__", "__rdivmod__", "__reduce__", "__reduce_ex__", "__repr__", "__rfloordiv__", "__rlshift__", "__rmod__", "__rmul__", "__ror__", "__rpow__", "__rrshift__", "__rshift__", "__rsub__", "__rtruediv__", "__rxor__", "__setattr__", "__setitem__", "__setslice__", "__setstate__", "__sizeof__", "__str__", "__sub__", "__subclasshook__", "__truediv__", "__xor__", "all", "any", "argmax", "argmin", "argpartition", "argsort", "astype", "base", "byteswap", "choose", "clip", "compress", "conj", "conjugate", "copy", "ctypes", "cumprod", "cumsum", "data", "diagonal", "dot", "dtype", "dump", "dumps", "fill", "flags", "flat", "flatten", "getfield", "imag", "item", "itemset", "itemsize", "max", "mean", "min", "nbytes", "ndim", "newbyteorder", "nonzero", "partition", "prod", "ptp", "put", "ravel", "real", "repeat", "reshape", "resize", "round", "searchsorted", "setfield", "setflags", "shape", "size", "sort", "squeeze", "std", "strides", "sum", "swapaxes", "take", "tobytes", "tofile", "tolist", "tostring", "trace", "transpose", "var", "view"]
irb(main):012:0> iris.target.to_a
=> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
In Ruby, you can get information on various states of an object with ʻObject # class and ʻObject # methods
, but VirtualModule follows this with similar methods (# vclass
and # vmethods
. ) Is prepared. As you might imagine,
# vclass visits the background job for the type of the value and returns it, and
# vmethods` returns the methods available for that object.
That's all for the explanation so far, but if you want to see more examples, I have some other examples on GitHub You can refer to it in / tree / master / example). It's an experimental implementation, so I think it's difficult to use in many places, but if anyone wants to use it, I'd be happy if you could tell me what you think about it.
Recommended Posts