No. Although ViennaCL provides all the facilities for the user to build her own multi-GPU implementations, the efficient use of multiple GPUs is very algorithm-dependent and may even be impossible due to overheads of PCI-Express-communication.
Accelerators are based on throughput-oriented architectures and can only show their full potential for sufficiently large data sizes. As a rule of thumb, a compute kernel needs to work on at least 100KB of data to hide PCI-Express latency. Even for integrated GPUs such as AMD's APU product line, the cost of launching a kernel is still on the same order of magnitude.
Many compute kernels (in particular: all BLAS Level 1 and 2 operations) are limited by the available memory bandwidth. This typically holds true for sparse matrices as well. GPUs which are integrated with the CPU on the same chip use the same memory link as the CPU, so the limiting resource is the same.
Laptop GPUs are optimized for low power consumption. You may use Laptop GPUs to debug your implementations, but you need to run on high-end discrete GPUs for best performance.
Yes, you can, but with the exception of the Python-wrapper PyViennaCL we do not provide complete wrappers for other languages yet. A shared library callable from C (and thus any other language which is able to call C functions) is currently under development, but will require more time for maturity.
There is no explicit single source of funding. We develop ViennaCL in our scientific spare time within more application-oriented projects, from which we extract the developed components and make them available in a library context. These projects have been funded by Austrian Science Fund (FWF), the European Research Council, and the FASTMath project within the US Department of Energy. Generous support has also been received in the course of the Google Summer of Code since 2011.
This is due to the just-in-time compilation of OpenCL kernels. The NVIDIA graphics driver caches the compiled kernels, therefore the overhead is only seen at the first run on a particular machine. The OpenCL SDKs of AMD and Intel, however, recompile all kernels with each program launch. Other OpenCL SDKs most likely show similar behavior.
ViennaCL is available as add-on package for the solver library PETSc, through which you can run the iterative solvers with full MPI-parallelization across nodes. Not all features of ViennaCL are available through PETSc, though.
Our aim is to make ViennaCL available on as many machines as possible. However, many enterprise-class machines do not ship with compilers supporting C++11. For example, the default compiler on CentOS 5.11 is GCC 4.4, which does not support any C++11 features at all.
The are multiple options available:
