Links
Tags
apache
armenia
books
bsd
c
c++
chips
cinema
concurrency
cooking
database
dragonfly
erlang
filesystem
freebsd
fun
hardware
java
javascript
json
languages
linux
lyric
mac_osx
mail
math
misc
music
personal
poems
presentation
programming
python
references
ruby
rubyjs
scm
software
spiking_neural_net
study
sysadm
sysarch
technology
testing
travel
virtualization
web
wee
windows
The last couple of days I’ve spent improving the performance of my neural net simulator that I wanted to embed in Ruby (for easier scripting). The pure C++ version is very fast until I embed it into Ruby, in which case performance drops significantly. Here some numbers:
20.98 seconds, pure C++
24.16 seconds, C++ embedded in Ruby
26.16 seconds, C++ embedded in Ruby with wrapping C++ objects in Ruby objects
36.90 seconds, C++ embedded in Ruby with loading code written in Ruby (implies wrapping)
Note that in the last case, the loading code is written in Ruby. The time spent in loading the net is not significant (under 1 second)!
So how comes it that the same code embedded in Ruby just runs slower? At first I thought it must have to do with some compiler flags like -fPIC or -shared. But no, that’s not the reason!
My latest theory is that of memory fragmentation. Loading a net in Ruby allocates a lot of objects making further memory allocations slower. That’s the only reason I now can think of. Would be nice to have two separate memory frames, one for Ruby and one for my C++ application so that fragmenting one doesn’t hurt the other. In my C++ application I have to allocate a lot of memory for the priority queues, mainly expand them in size to hold more entries. That’s the only downside of the priority queues I use. The need for expanding an array. Maybe I use some more advanced data structure, something what I found under the name DSplay. A DSplay uses three structures, a splay tree which holds the newest items, a calendar queue (for one year) for the medium items and a linked list for all the “far future” items. The good thing is that this combination doesn’t need that much allocation. A free list allocator is enough. Of course I’d use an implicit heap instead of the splay tree, which would make it a lot faster IMO.