Posts

Bash prompt with colors, and git branch

So, you're working on several machines at the same time, travelling from directories to directories, and using git with multiple branches. And, like me, you can't remember where you are, which git branch you're in, because you have half a dozen or more terminals open on your screen.

Here is a little trick you can do with bash, change the prompt.
First of all, you have to download this nice little script called git-prompt.sh and save it in your home directory as .git-prompt.sh

Then, edit your .bashrc and add the line


source ~/.git-prompt.sh

wherever you want. It can be near the beginning for example. If you don't have .bashrc, search for .bash_profileinstead.

Then search for the definition of the variable PS1 . And replace its line by the following:


export PS1='\h:\[\e[33m\]\W\[\e[0m\]$(__git_ps1 " (%s)")\$ '
If you can't find it, just add the line wherever you want but in any case, after the source command from above.

The variable PS1 defines your …

Boost Spirit and C++ 11

It's a little trick again to know about. This time it's about Boost Spirit and C++ 11. Boost Spirit is a C++ library to generate a parser, writing directly C++ code, unlike other parser generators like yacc, bison or ANTLR which have their own language. It's cool, very templated and produce fast code. And fun to use !
 If you want to compile the examples in the tutorial (like this one: http://www.boost.org/doc/libs/1_55_0/libs/spirit/example/qi/mini_xml3.cpp) with C++ 11, you will have a BIG error message, like hundreds of undecipherable lines of templates instantiation errors that are absolutely impossible to read if you're not a world expert, basically the author of the Spirit library.
First time I tried, I ended up with 100+ Kb of error messages ! I tried with different versions of my GCC compiler and clang++ from the LLVM project without any success.
After googling a lot, I finally found a nice little post describing the simple solution to make it work. At the beg…

Bayesian Latent models: the shortest explanation ever

I found this on wikipedia and this is presumably the best and shortest explanatation I ever found about latent variable models and especially these Bayesian non-parametric models:

The Chinese Restaurant Process is often used to provide a prior distribution over assignments of objects to latent categories.The Indian buffet process is often used to provide a prior distribution over assignments of latent binary features to objects. From the first definition, you clearly see the construction of your prior over the space of categories. You understand that you have an infinitely sized space where all the possible combinations of categories exist and you are building a distribution over this space which will then be used as a prior for the variables describing your objects.

I let you think about the second definition, but think about an infinite collection of labels you can put or not on each object wheter it has the feature or not (and maybe my explanation is not as clear as the definition f…

Google Summer of Code 2013

I'm mentoring 2 students for the Google Summer of Code 2013, with my colleague Nasos who is a regular contributor to the project.
Boost has 7 students this year and at Boost.uBLAS, we're happy to say we have 2 students among the 7 !

One will work on implementing missing functions in BLAS 1,2 and 3 and introducing CPU level parallelism like auto-vectorization and SSE instructions (by hand).
The other one will work on bringing parralelism at the core level but especially at the network level to make Boost.uBLAS one of the only general-purpose linear algebra library than can distribute computations over a network using MPI.

Creating a BTRFS filesystem on 2 disks

I know it has nothing to do with Machine Learning, AI or even C++ coding but think about it, it's also part of the job. You just received a massive dataset and you only have small hard disks. In my case I have 2 disks of 400Gb. I know it's small by today's standard and I don't want to bother with a lot of partitions and complex tree structure, especially because I need to store my massive dataset that requires more than... 400 G.

With modern Linux there are at least 2 solutions:
LVM the Logical Volume ManagerBTRFS a new filesystem that offers incredible features
I have been using LVM for years, so I decided to give BTRFS a try. Here is my simple setup:

I have a desktop computer in which I've just added 2 hard disks of 400 GbI want to group them like they would be a single 800 Gb diskI want RAID0 for speed. RAID0 will split up the data on the 2 hard disks at the same time, making my new hard disk twice as fast as one single hard disk (this is theory, in practice, it i…

a graphical model of relationships between univariate distributions

Image
Reading the excellent book Statistical Inference from George Casella and Roger L. Berger, I found a reference to an interesting work from Lawrence M. Leemis in 1986 that's worth reading. In fact, this paper has been updated in 2008 and presents a nice graphical model (but not really in the sense we all imagine now: just a graph) of univariate probability distributions and what it takes to go from one to another, like generalizations or specializations. The graph in this paper is really amazing !
Just for fun, the Graph:

Linear Algebra in C++

What ? Linear Algebra in C++, especially in Machine Learning, where almost everybody programs in Matlab ? Yes, because most of the time it's the language used in the industry. And not a bad one to be honest (please no language war here, I know many other languages are good, etc...). So, I'm the new manager of a Linear Algebra library called uBLAS. Yes, you're reading right, the one in the famous Boost libraries.
Boost is made of many libraries, focusing on templates programming and higher-order programming. One of them, uBLAS, is devoted to Linear Algebra. I created a companion website. It's one of the fastest library, however it lacks several things. We're working hard to improve that: documentation is was poor: I made a new one and still improving,no small SSE fast vectors, no GPU: we working already on this.I will change products prod(m1,m2)to a nicer m1*m2directly.a new impressive assignment operator: it's easier and more powerful to fill in you matrices than…