nuno job
nuno
job

geek. open-source enthusiast. shaping the future of the node.js ☁ @jitsu. founder @thenodefirm & curator @lxjs

this blog is foss software. it was created by hij1nx and tweaked by myself. you can fork it here

table of contents



conf.couchdb.com

Presented at the CouchDB conference. Here are the slides:




migrating a production couchdb database with joyent and stud

This article now lives in blog.yld.io/2014/03/26/migrating-a-production-couchdb-database-with-joyent-and-stud




ghcopy — using gist as your internet clipboard

dead simple way to copy paste your stdin to github

cat ~/.ssh/id_rsa.pub 2>&1 | ghcopy

it's configurable:

Options:
  -d, --description  description for this gist                              [default: "gist created by github.com/dscape/ghcopy"]
  -v, --verbose      output to console while creating the gist              [boolean]  [default: true]
  -f, --filename     filename for the file pasted in this gist              [default: "ghcopy.txt"]
  -p, --public       boolean defining if this gist should be public or not  [boolean]  [default: false]
  -o, --open         boolean defining if we should open it in a browser     [boolean]  [default: true]
  -t, --token        define a github token                                  [required]  [default: "84c90072d47a61c0d0e51c11c42896e0bf7f8be6"]
  -h, --help         this message



lynx — a minimalistic node.js client for statsd server

> var lynx = require('lynx');
//
// Options in this instantiation include:
//   * `on_error` function to be executed when we have errors
//   * `socket` if you wish to just use a existing udp socket
//   * `scope` to define the a prefix for all stats, e.g. with `scope`
//     'product1' and stat 'somestat' the key would actually be
//     'product1.somestat'
//
> var metrics = new lynx('localhost', 8125);
{ host: 'localhost', port: 8125 }
> metrics.increment('node_test.int');
> metrics.decrement('node_test.int');
> metrics.timing('node_test.some_service.task.time', 500); // time in ms
> metrics.gauge('gauge.one', 100);
> metrics.set('set.one', 10);



specify — simple nodejs testing

specify is the simplest way i could think to do node.js testing

var specify = require('specify');

specify('create_by_secret', function (assert) {
  user.create_by_secret({invitation_code: "1234321!!"}, function (err) {
    assert.equal(err.eid, "ec:api:user:create_by_secret:wrong_code");
    assert.equal(err.status_code, 400);
  });
});

specify.run();



statsd-parser — a streaming statsd parser

a streaming parser for the statsd protocol

var stream = require("statsd-parser").createStream(options);

stream.on("error", function (e) {
  // unhandled errors will throw, since this is a proper node
  // event emitter.
  console.error("error!", e);
  // clear the error
  this._parser.error = null;
  this._parser.resume();
})

stream.on("stat", function (txt, obj) {
  // same object as above
});
//
// pipe is supported, and it's readable/writable
// same chunks coming in also go out
//
fs.createReadStream("file.statsd")
  .pipe(stream)
  .pipe(fs.createReadStream("file-fixed.statsd"));



nodestack online conference

delivered a introductory talk on node.js. you can watch the recording:




lxjs 2012

can't believe this but i actually co-organized a conference:




joyent case study

interviewed by joyent about nodejitsu. check it here




p — pattern matching in javascript

pattern is a way to do pattern matching in javascript that helps you with asynchronous iterations. it's also crazy, don't use it unless you are also insane!

// map
// map _ []     = []
// map f (x:xs) = f x : map f xs
map(_, [], ac, cb, 
  function map_done(f, l, ac, cb) { return cb(ac); });
map(f, l, ac, cb, 
  function map_catch_all(f, l, ac, cb) {
    ac.push(f(l.shift())); // head
    map(f, l, ac, cb); // l is now tail
  });



how to make your first node.js pull request

if you use open source and github you are probably used to creating issues and having them magically solved for you. you just npm install a new version and the issue is gone. great right?

well module maintainers can solve the majority of issues for developers (end users) but this takes time. we need some kind of policy to determine how we react to a new issue on github. this is what i currently do:

the first thing you should do is to talk to the module owner (irc, email) and see if this is a issue.

if it is create an issue about what you found. you should include:

now wait for the module owner to respond to you. she will either:

if you are asked for a pull request, you need to fork the repo into your own user.

imagining your repo is dscape/foobar and the bug is issue63 this is what you could do:

git clone git@github.com:dscape/foobar.git
cd foobar
git branch issue63
git checkout issue63

now you need to makes the changes that fix your bug. this will likely involve some investigation. after you do so run:

git diff

this should have a small output, and should be limited to the minimum amount of lines of code necessary to fix the issue. if you have fixed other stuff that is unrelated, please undo, create a new issue, and do that in a new branch.

if it all looks ok for you, you can do a git status and check that you didn't introduce any files by mistake and finally git add . (or simple the files you changed one by one). ok, time to commit:

git commit -m "[fix minor] fixes #63

* Solves foo
* Modifies Behavior in bar
"

if you are curious about why i wrote fixes #63 in the commit message check the blog post about issues 2.0 on github.

now you need to add tests. read the read me section of the project and see if there's instructions on adding and running tests.

being very generic you should start of by running the existing test suite and make sure you didn't break anything:

npm install
npm test

you should also inspect the package.json file and make sure there's nothing else you should be running like un-mocked tests for instance.

now add your tests, and repeat the git diff, git status, git commit workflow.

if the tests all run you are ready test wise. add the docs, same workflow again.

now:

git push

now go to github and open a pull request while selecting the appropriate branch.

the extra mile

if you want to develop your own module you might be interested on what gets done then from the module maintainer perspective.

when you see a pull request you navigate and review the code, tests, documentation. you might go back and forward asking while some code is done in a certain way (e.g. code review)

if all is right you are going to do something like (in your module directory):

git remote add dscape git@github.com:dscape/foobar.git
git pull
git pull dscape issue63

now you need to run tests. when maintaining nano i normally do:

npm test
npm run nock_off

if there's bugs (which happens about 80% of the time cause people forget to run tests) the module maintained will fix them. she will add fixtures, adds mocks (whatever it takes) and commit them. after all is working, docs are fixed, etc the module maintainer will add you to the contributors list and go on to publish a new version. let's say this is version 0.0.2

git push

now we wait for travis tests to run and make sure travisci tests pass in multiple node versions etc. if they don't, the module maintainer will go on fixing bugs again.

when the tests finally pass:

git tag 0.0.2
git push --tags
npm publish

now the module maintainer will go back to the issue, close it, and warn the user the fix is available in version 0.0.2.




how to update a document with nano

couchdb is mvcc so updates are not done in place. when you insert a document into couchdb a pointer will say "for this uri this is the current version of the document".

so in a sense there's no updating in a mvcc database, updates mean changing a pointer.

in nano i deliberately tried not to have parts of the api called "update", or "connect" since those things are not things that you do in couchdb. in couchdb, you insert:

// insert {foo: "baz"} into the "foobaz" document
db.insert({"foo": "baz"}, "foobaz", function (error, foo) {   
  if(!err) {
    console.log("it worked");
  } else {
    console.log("sad panda");
  }
});

if you need to update a document then you should just insert again (but specifying the revision you are updating):

db.insert({"foo": "bar"}, "foobar", function (error, foo) {
  if(err) {
    return console.log("I failed");
  }
  db.insert({foo: "bar", "_rev": foo.rev}, "foobar", 
  function (error, response) {
    if(!error) {
      console.log("it worked");
    } else {
      console.log("sad panda");
    }
  });
});

you need to specify the revision so that couchdb can make sure for you that no one did conflicting updates while you where editing the document. if the rev you send to couchdb is not the latest rev you will get a conflict.

you can also use design documents to perform updates in couchdb. read more on how you can do that with nano at jackrussell.




futoncli — command line interface for couchdb

futoncli is a command line tool for managing and interacting with couchdb. it's open-source and easy to use

check the repo for more info

or just check out the video:

play futon command line client video in codestream




spell — norvig spell checker in javascript

spell is a dictionary module for node.js. for an explanation of the algorithm, performance, expectations, and techniques used please read this article

var dict = spell();
dict.load("I am going to the park with Theo today." +
  "It's going to be the bomb");
console.log(dict.suggest('thew'));



mock http integration testing in node.js using nock and specify

one of the big things in releasing nano 3 was updating the tests. i really wanted this release out before lxjs but i had a lot of requirements for the tests:

i decided to use nock and specify for this. nock because it's the only tool that can do this job. and specify because despite being written in 100 loc it is the only tool that can fill all of these requirements. (disclaimer: i'm only familiar with node-tap, mocha, & vows so other testing things might be able to do all of these requirements)

in this article we are going to npm install nano and try to write some tests for it. this is strange but serves for demonstration purposes only. you can check the nano tree for more tests if you feel like checking more stuff out after you are done here.

mkdir nano_mock_testing
cd nano_mock_testing
npm install nano specify nock

these are the versions that got installed in my pc

nock@0.13.0 ./node_modules/nock 
specify@0.4.0 ./node_modules/specify 
├── cycle@1.0.0
├── colors@0.6.0-1
└── difflet@0.2.1
nano@3.0.1 ./node_modules/nano 
├── errs@0.2.0
├── request@2.9.202
└── follow@0.8.0

and my node versions

$ node -e "console.log(process.versions)"
{ node: '0.6.7',
  v8: '3.6.6.15',
  ares: '1.7.5-DEV',
  uv: '0.6',
  openssl: '0.9.8r' }

now we can test the insert functionality of nano. let's write two tests, on testing a simple insert and another inserting functions inside documents. the test has a setup stage where we create a database and a teardown stage where we destroy the database we created. because of specify can handle uncaught exceptions we know that teardown will always execute, so this won't affect your future tests even if you run in un-mocked mode.

var specify  = require('specify')
  , helpers  = require('./helpers')
  , timeout  = helpers.timeout
  , nano     = helpers.nano
  , nock     = helpers.nock
  ;

// this will work with nocks when you set NOCK=on
// and without nocks when you don't set the environment variable
// NOCK
var mock = nock(helpers.couch, "doc/insert")
  , db   = nano.use("doc_insert")
  ;

specify("setup", timeout, function (assert) {
  nano.db.create("doc_insert", function (err) {
    assert.equal(err, undefined, "Failed to create database");
  });
});

specify("simple", timeout, function (assert) {
  db.insert({"foo": "baz"}, "foobaz", function (error, foo) {   
    assert.equal(error, undefined, "Should have stored foo");
    assert.equal(foo.ok, true, "Response should be ok");
    assert.ok(foo.rev, "Response should have rev");
  });
});

specify("functions", timeout, function (assert) {
  db.insert({fn: function () { return true; },
  fn2: "function () { return true; }"}, function (error, fns) {   
    assert.equal(error, undefined, "Should have stored foo");
    assert.equal(fns.ok, true, "Response should be ok");
    assert.ok(fns.rev, "Response should have rev");
    db.get(fns.id, function (error, fns) {
      assert.equal(fns.fn, fns.fn2, "fn matches fn2");
      assert.equal(error, undefined, "Should get foo");
    });
  });
});

specify("teardown", timeout, function (assert) {
  nano.db.destroy("doc_insert", function (err) {
    assert.equal(err, undefined, "Failed to destroy database");
    assert.ok(mock.isDone(), "Some mocks didn't run");
  });
});

specify.run(process.argv.slice(2));

this won't run. we reference a helpers.js that is not yet there. let's create it. it needs to add functionality to run tests in mocked and un-mocked mode, and expose things like the couchdb configuration and default timeouts.

var path    = require('path')
  , fs      = require('fs')
  , cfg     = {couch: "http://localhost:5984", timeout: 50000}
  , nano    = require('nano')
  , helpers = exports
  ;

function endsWith (string, ending) {
  return string.length >= ending.length && 
    string.substr(string.length - ending.length) == ending;
}

function noop(){}

function fake_chain() {
  return {
      "get"                  : fake_chain
    , "post"                 : fake_chain
    , "delete"               : fake_chain
    , "put"                  : fake_chain
    , "intercept"            : fake_chain
    , "done"                 : fake_chain
    , "isDone"               : function () { return true; }
    , "filteringPath"        : fake_chain
    , "filteringRequestBody" : fake_chain
    , "matchHeader"          : fake_chain
    , "defaultReplyHeaders"  : fake_chain
    , "log"                  : fake_chain
  };
}

helpers.timeout = cfg.timeout;
helpers.nano = nano(cfg.couch);
helpers.Nano = nano;
helpers.couch = cfg.couch;
helpers.pixel = "Qk06AAAAAAAAADYAAAAoAAAAAQAAAP////8BABgAAAAA" + 
                "AAAAAAATCwAAEwsAAAAAAAAAAAAAWm2CAA==";

helpers.loadFixture = function helpersLoadFixture(filename, json) {
  var contents = fs.readFileSync(
    path.join(__dirname, 'fixtures', filename), 'ascii');
  return json ? JSON.parse(contents): contents;
};

helpers.nock = function helpersNock(url, fixture) {
  if(process.env.NOCK) {
    var nock    = require('nock')
      , nocks   = helpers.loadFixture(fixture + '.json', true)
      ;
    nocks.forEach(function(n) {
      var path     = n.path
        , method   = n.method   || "get"
        , status   = n.status   || 200
        , response = n.buffer
                   ? new Buffer(n.buffer, 'base64') 
                   : n.response || ""
        , headers  = n.headers  || {}
        , body     = n.base64
                   ? new Buffer(n.base64, 'base64').toString()
                   : n.body
        ;

      if(typeof response === "string" && endsWith(response, '.json')) {
        response = helpers.loadFixture(path.join(fixture, response));
      }
      if(typeof headers === "string" && endsWith(headers, '.json')) {
        headers = helpers.loadFixture(path.join(fixture, headers));
      }
      if(body==="*") {
        nock(url).filteringRequestBody(function(path) {
          return "*";
        })[method](path, "*").reply(status, response, headers);
      } else {
        nock(url)[method](path, body).reply(status, response, headers);
      }
    });
    nock(url).log(console.log);
    return nock(url);
  } else {
    return fake_chain();
  }
};

some things to note here. response and body refer to an http response and the body of an http request (e.g. post). base64 means you have a base64 http request body in the fixture, and buffer means you have a base64 http response that should be converted into a buffer.

this is not the perfect solution. it's just a solution that fits testing nano, and also the reason why i believe this boilerplate code is a good thing vs. having it in a library: having this in nock would mean making these decisions for you, and i personally think you should have as little decisions made for your users as possible.

another thing to notice is that you can use * to accept any http request body, and that if a header or response ends with .json the helper will try to load a fixture with that name. we aren't going to use this, but it can be handy.

we should now be able to run this unmocked (we haven't create the mocks yet remember):

$ node insert.js 

  /insert.js

✔ 1/1 setup 
✔ 3/3 simple 
✔ 5/5 functions 
✔ 2/2 teardown 
✔ 11/11 summary

And the database got destroyed (as it should) by the teardown phase:

$ curl localhost:5984/doc_insert
{"error":"not_found","reason":"no_db_file"}

to run the mocked tests we need to add the fixture at fixtures/doc/insert.json. this isn't required by nock, it's something we defined in our helpers.js file:

[
  { "method"   : "put"
  , "path"     : "/doc_insert"
  , "status"   : 201
  , "response" : "{ \"ok\": true }" 
  }
, { "method"   : "put"
  , "status"   : 201
  , "path"     : "/doc_insert/foobaz"
  , "body"     : "{\"foo\":\"baz\"}"
  , "response" : "{\"ok\":true,\"id\":\"foobaz\",\"rev\":\"1-611488\"}"
  }
, { "method"   : "post"
  , "status"   : 201
  , "path"     : "/doc_insert"
  , "body"     : "{\"fn\":\"function () { return true; }\",\"fn2\":\"function () { return true; }\"}"
  , "response" : "{\"ok\":true,\"id\":\"123\",\"rev\":\"1-611488\"}"
  }
, { "path"     : "/doc_insert/123"
  , "response" : "{\"fn\":\"function () { return true; }\",\"fn2\":\"function () { return true; }\",\"id\":\"123\",\"rev\":\"1-611488\"}"
  }
, { "method"   : "delete"
  , "path"     : "/doc_insert"
  , "response" : "{ \"ok\": true }" 
  }
]

Exit CouchDB and try it out:

$ NOCK=on node insert.js 

  /insert.js

✔ 1/1 setup 
✔ 3/3 simple 
✔ 5/5 functions 
✔ 2/2 teardown
✔ 11/11 summary

if you try running the tests un-mocked it will fail, since you switched couchdb off!

testing binaries streams

wow you made it this far! let's just add a simple streaming test where we insert a pixel in couchdb. let's call it pipe.js:

var fs       = require('fs')
  , path     = require('path') 
  , specify  = require('specify')
  , helpers  = require('./helpers')
  , timeout  = helpers.timeout
  , nano     = helpers.nano
  , nock     = helpers.nock
  , pixel    = helpers.pixel
  ;

var mock = nock(helpers.couch, "att/pipe")
  , db   = nano.use("att_pipe")
  ;

specify("setup", timeout, function (assert) {
  nano.db.create("att_pipe", function (err) {
    assert.equal(err, undefined, "Failed to create database");
  });
});

specify("test", timeout, function (assert) {
  var buffer   = new Buffer(pixel, 'base64')
    , filename = path.join(__dirname, '.temp.bmp')
    , ws       = fs.createWriteStream(filename)
    ;
    ws.on('close', function () {
      assert.equal(fs.readFileSync(filename).toString('base64'), pixel);
      fs.unlinkSync(filename);
    });
    db.attachment.insert("new", "att", buffer, "image/bmp", 
    function (error, bmp) {
      assert.equal(error, undefined, "Should store the pixel");
      db.attachment.get("new", "att", {rev: bmp.rev}).pipe(ws);
    });

});

specify("teardown", timeout, function (assert) {
  nano.db.destroy("att_pipe", function (err) {
    assert.equal(err, undefined, "Failed to destroy database");
    assert.ok(mock.isDone(), "Some mocks didn't run");
  });
});

specify.run(process.argv.slice(2));

And create the fixture in fixtures/att/pipe.js:

[
  { "method"   : "put"
  , "path"     : "/att_pipe"
  , "status"   : 201
  , "response" : "{ \"ok\": true }" 
  }
, { "method"   : "put"
  , "path"     : "/att_pipe/new/att"
  , "base64"   : "Qk06AAAAAAAAADYAAAAoAAAAAQAAAP////8BABgAAAAAAAAAAAATCwAAEwsAAAAAAAAAAAAAWm2CAA=="
  , "status"   : 201
  , "response" : "{\"ok\":true,\"id\":\"new\",\"rev\":\"1-3b1f88\"}\n"
  }
, { "path"     : "/att_pipe/new/att?rev=1-3b1f88"
  , "status"   : 200
  , "buffer"   : "Qk06AAAAAAAAADYAAAAoAAAAAQAAAP////8BABgAAAAAAAAAAAATCwAAEwsAAAAAAAAAAAAAWm2CAA=="
  }
, { "method"   : "delete"
  , "path"     : "/att_pipe"
  , "status"   : 200
  , "response" : "{ \"ok\": true }" 
  }
]

updated nock

nock now supports NOCK_OFF setting. check the documentation for details. this means that in your helpers.js file fake_chain and the if should no longer be necessary




nano 3

nano is a dead simple, minimalistic couchdb client for node.js.

we just released version three, and this post outlines so important changes.

pool size & cookies

one of the things the users where worried the most was about the lack of support for authentication and setting the pool size. as for authentication, it normally ends up being unnecessary and based on miss conceptions by users on how couchdb works.

as for the pool size, well that's another story.

in node.js your pool size determines the maximum number of parallel active connections you can run, while others get queued. the default is 5 (reference needed) and for some applications changing this is an important feature of a couchdb client.

nano now support a new object literal in the configuration called request_defaults that will help you doing this. you should follow the request documentation for details, as they are beyond the scope of nano.

var db = require('nano')(
  { "url"             : "http://localhost:5984/foo"
  , "request_options" : { "proxy" : "http://someproxy" }
  , "log"             : function (id, args) { 
      console.log(id, args);
    }
  });

follow

if you love follow and you are tired of requiring both nano and follow this is your release. you can now use db.follow just like you use follow:

var feed = db.follow({since: "now"});
feed.on('change', function (change) {
  console.log("change: ", change);
});
feed.follow();
process.nextTick(function () {
  db.insert({"bar": "baz"}, "bar");
});

callback style is also supported.

atomic

we have updatewithhandler since nano@2.x.x. however this method was renamed to atomic. another api change is that the document is now sent over using body, instead of the query string. this fixes the limitation many of our production users had performing atomic updates with very large documents:

db.atomic("update", "inplace", "foobar", 
{field: "foo", value: "bar"}, function (error, response) {
  assert.equal(error, undefined, "Failed to update");
  assert.equal(response.foo, "bar", "Update worked");
});

a fully functioning example is on the test suite. same as for other methods really, the tests are an excellent source of working samples.

streaming bug fixes

there was a problem when streaming non attachments from couchdb in nano@2.1.0. this was fixed in both nano@2.1.1 and nano@3.0.0.

new tests

testing http apis is hard. i think i finally cracked a way that is both easy to ready (and where tests also work as live examples) while not having a lot of clutter related to http mocking:

var specify  = require('specify')
  , helpers  = require('../helpers')
  , timeout  = helpers.timeout
  , nano     = helpers.nano
  , nock     = helpers.nock
  ;

var mock = nock(helpers.couch, "att/insert")
  , db = nano.use("att_insert")
  ;

specify("att_insert:setup", timeout, function (assert) {
  nano.db.create("att_insert", function (err) {
    assert.equal(err, undefined, "Failed to create database");
  });
});

specify("att_insert:test", timeout, function (assert) {
  db.attachment.insert("new", "att", "Hello World!", "text/plain",
    function (error, att) {
      assert.equal(error, undefined, "Should store the attachment");
      assert.equal(att.ok, true, "Response should be ok");
      assert.ok(att.rev, "Should have a revision number");
  });
});

specify("att_insert:teardown", timeout, function (assert) {
  nano.db.destroy("att_insert", function (err) {
    assert.equal(err, undefined, "Failed to destroy database");
    assert.ok(mock.isDone(), "Some mocks didn't run");
  });
});

specify.run(process.argv.slice(2));

mocks are now on fixtures and they look like this:

[
  { "method"   : "put"
  , "path"     : "/att_insert"
  , "status"   : 201
  , "response" : "{ \"ok\": true }" 
  }
, { "method"   : "put"
  , "path"     : "/att_insert/new/att"
  , "body"     : "\"Hello World!\""
  , "status"   : 201
  , "response" : "{\"ok\": true, \"id\": \"new\", \"rev\": \"1-921bd51\" }"
  }
, { "method"   : "delete"
  , "path"     : "/att_insert"
  , "status"   : 200
  , "response" : "{ \"ok\": true }" 
  }
]



my contribution to jsconf 2012 the good and the bad open source

s/art|graffiti/open-source/g

this talk is about the good open-source and bad open-source.

this talk is about how marketeers and advertisers are targeting us developers, disrupting our thoughts by sugar coating their products with things we love.

open-source.

javascript.

this talk has a lot of quotes from well known graffiti artist banksy.

i want to show you how much the worries and thoughts of this artist can resemble our own.

if they don’t, i think they should.

there bad actors in our scene sell us silly ideas like "startup life", "web-scale", and normally use fud as a main argument.

"acid doesn’t scale!"

they own (or at least want to own) the communication channels we use everyday.

they are cashing in on us, they are selling our info, they are selling the ownership of our free thinking and passions.

for me this talk started some weeks, maybe months ago, when jan shared this quote from banksy's book on twitter.

it says that you have no choice whether or not your see an advertisement in a public space hence it belongs to you.

he concludes that asking for permission to change it is like asking to keep a rock someone just threw at your head.

what grabbed my attention though was something different. how much the marketers are flooding us daily with notions they want to sell as “truth”:

soul mates.

glamorous air travel.

cool clothes.

heck, they even invented santa claus.

today, more than ever, this applies to us developers. and the favorite word of these bad players is open-source.

open-source. community, curiosity & serendipity

if you think of open-source as being free beer, there’s this guy that brewed the perfect beer and he open sourced the formula so people could enjoy what he created. he’s sole purpose was to improve your experience when drinking a beer. and he gives you his formula in the hope that you will further improve it.

in the other kind of open-source they give you the free beer just cause they want to get you drunk. then you wake up next day and you are missing a kidney.

it doesn’t take much to understand that for me, open-source is about the guy that is sharing his formula:

a project that is community lead,

where curiosity drives innovation

and serendipity is the natural instrument for learning and the growth of the community.

a project that doesn’t take pull requests is not open-source. it’s a marketing vehicle to get you, the product.

or like banksy would put it.

the people who run our cities don’t understand graffiti because they think nothing has the right to exist unless it makes a profit...

the people who run our cities don’t understand graffiti 
because they think nothing has the right to exist 
unless it makes a profit...

story 1 - couchdb

let me tell you a story. i was in working for a great closed source that does a document database that is like 10 years in the future. you can think of it as a database that shares the best traits of couchdb and the best benefits on elasticsearch.

i been wanting to share how it worked with the open-source community, mostly to incentivize innovation and see if we could get something like it as an open-source product.

when i was invited to give a talk about it at berlin buzzwords i was super happy to share the internals of the product with the community.

the talk was great and in the end i was hanging out with some pals from couchdb, dale and volker. i love couchdb, it’s a great open-source product and i use it everyday. then a guy approaches us and introduces himself with a piece of paper with some hexadecimal stuff on it.

“it’s my public key” he says.

after talking a bit, this person starts giving me grief for working in a closed source solution. i asked him what database solution he used. he happened to use a gpl licensed open-source database on which the company that owns the trademark has all the commits. it’s that kind of read only open-source.

not going into specifics, this shows the power of marketing. that product is only open-source because this company wanted a lot of adoption. a product that doesn't take pull requests.

while i was there to try to share the advances produced in a lesser known, closed source database, i got grief from this kid that didn’t even knew what he was using.

the art we look at is made by only a select few. a small group create, promote, purchase, exhibit and decide the success of art

we are to blame for this.

as developers we perpetuate the cruel joke that marketers pull on us.

we recommend open-source projects blindly based on authors, and we promote our own shit even when its broken or not the right tool for the job.

yes, make people know what your open-source projects solve. but also explain what it is not designed to do.

don’t be a bullshiter.

stop idolizing and wanting to work in company x, y, or z. focus on the stuff you are doing instead and in your community.

you don’t need an invitation to do open-source, you are already invited.

story 2 stay humble. promote change not brands

brings me to story #2.

at mikeal’s party a bunch of the greatest people i know where there. i remember i was hacking with paolo and learning some serious node.js when ryan, the guy that invented node, walks by on skype with his friends.

he shows us to his friends, and one of his friends asks if we are “smart”.

his answer was amazing, he said

“it’s not like we work at google or facebook, we are just programmers”

i don’t think anyone in that would have a problem getting a job at a large tech company, but i was humbled by this answer. we are just programmers.

it taught me:

stay humble, promote change not brands.

open source is community

a bunch of us creates consulting companies, some run conferences, some build awesome open-source products, and so on...

these are the nice people.

and, most importantly, you are not the product.

join us.

organize meetups, create alternative business with fair business models.

buy their products, donate to their screencasts.

stop expecting everything on the internet to be free.

but most of all, stop going to the bad players because of fud or insurance.

a lot of people never use their initiative, because no-one told them to

banksy said people don’t use initiative because no one told them to

so i came to jsconf 2012 to tell you to.

speak softly, but carry a big can of paint

there’s no need for negativity, worse yet trolling. but carry a big can on paint

exercise critical thinking, always.

thank you.




how do you stop your best people from leaving your team?

i have the privilege to work with a lot of great individuals. lot has been said about managing people this new wave of talented, creative, and extremely ambitious individuals.

it's not easy, we are the a generation of people that love what they do and are extremely good at doing what we do. i don't think a generation ever worked as hard as we do, and we think about things that employees historically where not used to think about: business, marketing, company vision & values, branding, and messaging of our products. we think about how that impacts our lives and how we can leverage the rules that exist in our world to create better products that have a social impact.

that's why we care about concepts like open-source, and we try crazy hard to monetize these concepts which are both fair and allow for sustainable progress. it's not always an easy battle, but that was the topic of my jsconf talk this year, so you should check the transcript out when we publish it in the nodejitsu blog.

in here i just wanted to share a question; and the conclusion i got. i've seen people ask this question a million times, especially in silicon valley. i never agreed with any of the answers, mostly from clueless y-combinator type founders.

q: how do you stop your best people from leaving your team?

i think the answer is "ask them what do they want to do next".

if that doesn't include your team, try your hardest to offer a challenge to this person that is exactly what she/he is looking for. if she/he doesn't know what she/he is looking for, help her/him out with that quest. now keep asking.

if the answer included you and your team, sweet! now you need to make sure you make them successful and help them achieve their goals. and keep asking!

what is your answer?




the cube interview — the node firm

interviewed by the cube about the node firm. check the video here

you can check my profile at the node firm here:




clarinet — sax based evented streaming json parser in javascript

i'm super happy to announce clarinet. it's currently running ~110 tests both in the browser and in node.js which include some of the most obtuse edge cases i could find in other json parser tests, and can currently parse all npm registry without blinking.

clarinet is not a replacement for json.parse. if you can json.parse, you should. it's super fast, comes bundled in v8, and that's all. move along.

my motivation for clarinet was to stream large (or small) chunks of json data and be able to created indexes on the fly.

clarinet is more like sax. it's a streaming parser and when you feed it json chunks it will emit events.

or in code:

var chunks = ['{"foo":', ' "bar', '"}'];

you can't parse that. even if you control the source that is emitting those chunks there's plenty of situations where you can't just emit a 10mb file in one chunk. also if your json file is larger than the memory you have available on your computer, you need clarinet.

this is how you would implement substack's npmtop, a tool that returns an list on npm modules authors ordered by the number of modules they publish, with clarinet:

var fs             = require('fs')
  , clarinet       = require('clarinet')
  , parse_stream   = clarinet.createStream()
  , author         = false // was the previous key an author?
  , authors        = {}    // authors found so far
  ;

// open object is emitted when we find '{'
// the name is the first key of the json object
// subsequent ones will emit key
parse_stream.on('openobject', function(name) {
  if(name==='author') author=true;
});

// a key was found
parse_stream.on('key', function(name) {
  if(name==='author') author=true;
});

// we got all of npm, lets aggregates results 
// and sort them by repo count.
parse_stream.on('end', function () {
  var sorted = []
    , i
    ;
  for (var a in authors)
    sorted.push([a, authors[a]]);
  sorted.sort(function(a, b) { return a[1] - b[1]; });
  i = sorted.length-1;
  while(i!==-1) {
    console.log(sorted.length-i, sorted[i]);
    i--;
  }
});

// value is emitted when we find a json value, just like in the
// specification in json.org: strings, true, false, null, and number.
//
// you can find out the value type by running a typeof
//
// this could be faster if we emitted different events for each value.
// e.g. .on('string'), .on('true'), etc..
//
// would be faster cause clarinet wouldn't have to parse it for you
// but this api choice seemed easier for the developer 
// that needs to have less events
// to attend to
parse_stream.on('value', function(value) {
  if(author) { 
    // get the current count for this author
    var current_count = authors[value];
    // if it exists increment it
    if (current_count) authors[value] +=1;
    // else it's the first one
    else authors[value] = 1;
    // this is not an author key
    author=false; 
  }
});

// create a read stream and pipe it to clarinet
fs.createReadStream(__dirname + '/npm.json').pipe(parse_stream);

feel free to browse the docs and samples for more goodies. feedback is great, pull requests are even better.

performance

tl;dr

since having a streaming parser requires consistently understanding performance implications, i've done a preliminary study on how well clarinet performs. source code is open so you are welcome to replicate.

because none of the other parsers tests was able to do streaming json parsing i had to create a test that uses fs.readfilesync so all of the parsers could be tested. this sucks, i really wanted to test async parsers but none existed. i tried yajl (a c++ module that looks a lot like clarinet) but it's current version does not build in node 0.6. jsonparse should also be able run asynchronously but since it's not documented properly and was made in a previous version of node i was unable to make it work . if you are looking for differences between jsonparse and clarinet:

if you want your parser to be included, or refute any of my claims please send me an email and i'll fix this article provided you give me source code and results to go along with it.

i created an async version of the tests but only clarinet is included there for obvious reasons. in the process i also created a profiling page that can help you get profiling information about clarinet using google chrome developer tools.

in detail

in the test we've compared clarinet, json.parse (referred as v8 in the tables and figures) and @creationix. to avoid sample bias i've tested all modules against four different json documents:

in order to test whether clarinet, json.parse, and the jsonparse modules differed in terms of execution time, i conducted analyses of variance (anovas). to obtain the estimate data, i ran scripts that created 10 runs for each json document, resulting in 40 measurements per parser.

then, the three modules were at first compared between each other, regardless of the documents that generated their values (i.e. the execution times). this anova showed that the differences in the execution times obtained with clarinet, json.parse, and jsonparse, were statistically significant (f(2,117) = 40.28, p = .000).

post-hoc tests with scheffe correction revealed that the execution times of json.parse module were statistically different from both clarinet and jsonparse, but these two did not differ from one another (see table 1 and fig. 1). specifically, json.parse module demonstrated smaller execution times than both clarinet and jsonparse.

table 1 figure 1

next, given that two of the documents were "big" (i.e. > 1mb), and two of them were "small" (i.e. < 1mb) and the expectation that the size of the documents would play an important role in the performance of the modules, i computed an anova to compare the execution times of the modules for the "big" documents, and another anova to compare them for the "small" documents.

the differences between the execution times of clarinet, json.parse, and jsonparse were statistically significant for the "big" documents (f(2,57) = 279.96, p = .000).

post-hoc tests with scheffe correction revealed that the execution times of the modules were statistically different between the three of them (see table 2 and fig. 2).

table 2 figure 2

the differences between the execution times of clarinet, json.parse, and jsonparse were also statistically significant for the "small" documents (f(2,57) = 36.95, p = .000).

post-hoc tests with scheffe correction revealed that the execution times of the modules were once again statistically different between the three of them (see table 3 and fig. 3).

table 3 figure 3

in conclusion, the execution times of the three modules under analysis were different in all the conditions tested (i.e. regardless of the document size, for big documents only, and for small documents only), but this difference was greater when considering the estimates made for dealing with big documents only (which can be seen by the f ratios), where the json.parse demonstrated clearly smaller execution times.




database indexes for the inquisitive mind

i've used to be a developer advocate an awesome database product called marklogic, a nosql document database for the enterprise. now it's pretty frequent that people ask me about database stuff.

in here i'm going to try to explain some fun stuff you can do with indexes. not going to talk about implementing them but just about what they solve.

the point here is to help you reason about the choices you have when you are implementing stuff to speed up your applications. i'm sure if you think an idea is smart and fun you'll research what's the best algorithm to implement it.

if you are curious about marklogic you can always check the inside marklogic white-paper.

range indexes

the most frequent type of indexes in a database are range indexes. they allow you to do really fast order bys, count, aggregates, etc. let's think about a location index. i can define a index that says if a document contains a json property called local then add that property to a range index called location treating that value as a string

index           count            documents
algeria         2                c, d
australia       1                a
canada          5                a, b, c, d, e
portugal        3                a, b, c
togo            5                b, c, d, e, f

this means that document c and d have local algeria and so on. so now i can ask the database to give me the list of countries by first letter (including frequencies):

a (3)
c (5)
p (3)
t (5)

you can now display this to the user and they can use it to drill down in the content, even if visually it's impossible to display all the option that exist. you could also combine this with other visualizations that could, for example, say choose locations in countries started with a and that are ruled by evil dictators. you would just need to add another index to the evil dictators json property.

now considering this a use can press the a (3) tab. in the index you can slice it up and get these two rows of documents. now do a merge sort and you get:

a,c,d

meaning document a, c, and d are located in countries that start with the letter a.

same technique can be used for sorting documents, and executing fast aggregates, etc. you can normally keep a bunch of these in memory cause they are fairly lean indexes, and if you created them you probably need them! this is a technique that also allows you to do ranges in dates and express stuff like in the last 3 days, or during the eighties (decade), etc. super cool and super useful.

one thing this index does not give you is what are the locations that belong to document a. this would be an equivalent of a full table scan. so you can create the other way around, meaning associate documents with the locations they have. for this example that would be:

document        count            locations
a               3                australia, canada, portugal
b               3                canada, portugal, togo
c               4                algeria, canada, portugal, togo
d               3                algeria, canada, togo

now it's fairly trivial to say the locations for document a isn't it? :) so just create this by omission when your use asks for a range index on location and he can have both. :)

the disadvantage with range indexes is you have to define them to use them, meaning if you forget to create an index and then do a ad-hoc query performance will suck. or it will timeout. it will likely timeout if you are doing anything serious with the data. full table scans take time.

inverted index

inverted indexes are what power search engines today, and for me one of the most revolutionary thing that happened to databases up until now. we all accept that full text search sucks in databases right?

however search engines showed us the value behind this structure: gives me any text in any form and ask any question and against words, i can answer quickly. all the effin full internet. yeah!

a inverted index answers questions like find me document that contain the word blue but not the word black. they are kind of like the index in the back of a book. you can just see what pages the word blue appears on (let's call it set a), and then what pages the word black appears (set b). what we are looking for is:

a except b

and we can just go on adding constraints. the cool thing about it is that with a inverted index the more conditions you add you diminish the query granularity, which normally translate to less io and cpu, which means faster queries.

a inverted index looks a lot like a hash table. you hash the word and place it in a hash table. then, like in the range index, you keep an array of the document that match that term. unlike the range index the inverted index is hashed, thus not ordered. unlike the range index the inverted index is not lean and indexes every single word it find in a document.

term            term list
red             c
blue            a, b
black           a
run             a
running         b, c

this is a hash table, unordered and you don't have access to the keys. if you ask for words started with b this index is useless, you can only find things after running them thru the hash function. however this makes things like stemming super easy. when hashing you can coalesce words like run, running, ran to the same hash. this mean you can understand these words are the same for the purpose of the search directly out of the index. actually if you stem terms before hashing them you loose the ability to distinguish if the word was run or running.

every time you insert a document you need to go thru every word and add it to this index. so if you have a document with 80 thousand words that document triggers 80 thousand updates to the index. this takes time.

since you can't really control the complexity of the indexing algorithm (considering you are a smart guy and implemented the most efficient algorithm for your problem) all you can do is control the n.

in other words you can have all indexes updating a single entry, the giant index, and then you cant give the user guarantees of when it will be ready (other than eventually). or you can have the index partitioned so that you control the n, this way you can paralyze better and give real time results to your user. the problem with this approach is that query performance degrades with partitioning (e.g. the index for the word blue now exists in multiple partitions) and you need to compact your indexes eventually. the partitioning technique marklogic and other nosql databases use is the lsm-tree cache, the versioning and compaction at the database level technique is called multi-version concurrency control.

in marklogic there's actually a fun twist to this: they only do writes in memory and keep the indexes and documents in a buffer (it's double buffered so when a flush happens you don't have to wait for a new buffer to be ready). writes are journaled into disk so that if a computer crashes, marklogic can recreate the indexes and memory artifacts from index. so everything is written to memory and when it's full it gets flushed to disk. the actually compaction part of the lsm-tree happens on the artifacts that are in disk and not in memory.

inverted indexes are super fun and they should be in the core of any modern database systems. they will in some time :)

universal index

so with the universal index i can ask any question that goes against words and get real fast responses on "ad-hoc" queries without doing a full table scan. sweet. but if i need anything that relates to json i'm in trouble, the inverted index only indexes words.

this is where the guys at marklogic invented something super cool called the universal index. the idea is when you are indexing words you also index the structure of the document. first let me tell you a story so you understand why the universal index works on parent child associations to store structure.

how would you create an entry in the inverted index to find a phrase?

imagine i'm looking for the phrase "something wrong with" in document a

there's something wrong with me, i'm a cuckoo

if you use a normal inverted index you can find document that have the word "something", documents that have the word "wrong", and documents that have the word "with". but loads of those documents that do have all those terms won't have the sentence. in an ideal world, for this search, you would be grouping terms 3 by 3 to power this search:

term                        term list
there's something wrong     a
something wrong with        a
wrong with me               a
with me im                  a
me im a                     a
im a cuckoo                 a

now if i did that search i wouldn't find any false positives in the index, which means smaller query granularity, which normally translates to a faster query. however with this index you can't search for two word phrases. so the default in most search indexes is to index phrases by grouping words two by two.

term                        term list
there's something           a
something wrong             a
wrong with                  a
with me                     a
me im                       a
im a                        a
a cuckoo                    a

if you find for sentences that are more than two words you can still have false positives but the query granularity is probably much better and will likely work.

so why all this now? in the universal index you augment the inverted index with structure about documents. so things like parent child relationships are stored. things like the value of a property is stored. this augments the inverted index and makes it super useful. e.g. for the following document a:

{ "site":
  { "name": "github"
  , "description": "social coding done right"
  }
, "owner": "Pedro"
}

would produce the following index:

term                                                 term list
word:github                                          a
word:social                                          a
word:coding                                          a
word:done                                            a
word:right                                           a
property:name=github                                 a
property:description=social coding done right        a
property:owner=pedro                                 a

and now we can answer super complicated questions like "give me documents that have the word github but not the term red that are owned by pedro and are named github".

marklogic took this to another level by adding security, collections (kind of gmail labels) and even structuring directories using a inverted index. the beauty of it is the more complicated you make it the faster the query returns.

e.g. if i have 1 billion of documents but only 10 are mine the security can see, right from the index, that i can only see those 10 documents. so the maximum io i can do is 10, even in such a large dataset.

conclusion

there's some more fun stuff i could right about but maybe in another article.

fun stuff like how to store what your user likes and have indexes that help you alert in scale, or register queries in your system, or even using map reduce to queries views in couchdb.

feel free to check the inside marklogic paper, it goes into infinite more detail than this text.




mock testing couchdb in node.js with nock and tap

one of my first node.js libraries was nano: a no fuss couchdb client based on the super pervasive request. in foresight that was a good idea, even though there's a ton of clients for couchdb none of them is as simple as nano, and any http client that is not based on request is not something i would even consider.

when you are writing a http client you need to test with one (or several) http endpoints. i was lazy about it so i choose to point nano to iriscouch and run the tests on real http requests (even found a bug in node.js in the way, now fixed in 0.6+). this was a problematic but overall ok approach.

then some weeks ago i started automating the tests using travis. and builds started to fail. to make this work and fix all the shortcomings of the connect to iriscouch i needed a http mocking module.

by the way travis is super cool. you should test all your node.js libraries with it. all you need to do is go to the site, sign in with github and place a .travis.yml file like this one in the root of your lib:

language: "node_js"
node_js:
  - 0.4
  - 0.6

enter nock

pedro teixeira's nock allows you do http mock testing while preserving the possibility to run the tests against a real http endpoint.

let's start on this small tap test sudo npm install tap nano nock:

var nano = require('nano')('http://nodejsbug.iriscouch.com') 
var test = require('tap').test;
var db   = nano.use('testing_nock');

test('Insert a Document Into CouchDB', function(t) {
  t.plan(4);
  nano.db.create('testing_nock', function () {
    db.insert({foo: "bar"},
      function ensure_insert_worked_cb(err, doc) {
        t.notOk(err, 'No errors');
        t.ok(doc.ok, 'Contains ok');
        t.ok(doc.rev, 'Rev exists');
        t.ok(doc.id, 'Id exists');
      });
  });
});

if we save this in a file test.js we can run the tests and see they all work. we can even invoke the script with debugging turned on and inspect the http requests/response flow:

$ NANO_ENV=testing node test.js 
{ url: 'http://nodejsbug.iriscouch.com' }
>>
{ method: 'PUT',
  headers: 
   { 'content-type': 'application/json',
     accept: 'application/json' },
  uri: 'http://nodejsbug.iriscouch.com/testing_nock' }
<<
{ err: null,
  body: { ok: true },
  headers: 
   { location: 'http://nodejsbug.iriscouch.com/testing_nock',
     date: 'Thu, 01 Dec 2011 16:42:21 GMT',
     'content-type': 'application/json',
     'cache-control': 'must-revalidate',
     'status-code': 201 } }
>>
{ method: 'POST',
  headers: 
   { 'content-type': 'application/json',
     accept: 'application/json' },
  uri: 'http://nodejsbug.iriscouch.com/testing_nock',
  body: '{"foo":"bar"}' }
<<
{ err: null,
  body: 
   { ok: true,
     id: 'f191a858a66828d8de66b3c974005346',
     rev: '1-4c6114c65e295552ab1019e2b046b10e' },
  headers: 
   { location: 'http://nodejsbug.iriscouch.com/testing_nock/f191a858a66828d8de66b3c974005346',
     date: 'Thu, 01 Dec 2011 16:42:22 GMT',
     'content-type': 'application/json',
     'cache-control': 'must-revalidate',
     'status-code': 201 } }
# Insert a Document Into CouchDB
ok 1 No errors
ok 2 Contains ok
ok 3 Rev exists
ok 4 Id exists

1..4
# tests 4
# pass  4

# ok

so nano gives you a way to actually see all the http traffic that it creates and receives. this is great but i still need to write code to support these interactions.

with nock this is super simple:

var nano = require('nano')('http://nodejsbug.iriscouch.com') 
var nock = require('nock'); // we require nock
var test = require('tap').test;
var db   = nano.use('testing_nock');

nock.recorder.rec();

test('Insert a Document Into CouchDB', function(t) {
  t.plan(4);
  nano.db.create('testing_nock', function () {
    db.insert({foo: "bar"},
      function ensure_insert_worked_cb(err, doc) {
        t.notOk(err, 'No errors');
        t.ok(doc.ok, 'Contains ok');
        t.ok(doc.rev, 'Rev exists');
        t.ok(doc.id, 'Id exists');
      });
  });
});

running the tests returns:

$ node test.js 

<<<<<<-- cut here -->>>>>>

nock('nodejsbug.iriscouch.com')
  .put('/testing_nock')
  .reply(412, "{\"error\":\"file_exists\",\"reason\":\"The database could not be created, the file already exists.\"}\n", { server: 'CouchDB/1.1.1 (Erlang OTP/R14B04)',
  date: 'Thu, 01 Dec 2011 17:43:30 GMT',
  'content-type': 'application/json',
  'content-length': '95',
  'cache-control': 'must-revalidate' });

<<<<<<-- cut here -->>>>>>

<<<<<<-- cut here -->>>>>>

nock('nodejsbug.iriscouch.com')
  .post('/testing_nock', "{\"foo\":\"bar\"}")
  .reply(201, "{\"ok\":true,\"id\":\"8b787a6a1c2476ef9a2eed069e000ff0\",\"rev\":\"1-4c6114c65e295552ab1019e2b046b10e\"}\n", { server: 'CouchDB/1.1.1 (Erlang OTP/R14B04)',
  location: 'http://nodejsbug.iriscouch.com/testing_nock/8b787a6a1c2476ef9a2eed069e000ff0',
  date: 'Thu, 01 Dec 2011 17:43:31 GMT',
  'content-type': 'application/json',
  'content-length': '95',
  'cache-control': 'must-revalidate' });

<<<<<<-- cut here -->>>>>>

# Insert a Document Into CouchDB
ok 1 No errors
ok 2 Contains ok
ok 3 Rev exists
ok 4 Id exists

1..4
# tests 4
# pass  4

# ok

so now all we need to do is add these nock http mocks and we are done:

var nano = require('nano')('http://nodejsbug.iriscouch.com') 
var nock = require('nock'); // we require nock
var test = require('tap').test;
var db   = nano.use('testing_nock');

var couch = nock('nodejsbug.iriscouch.com')
  .put('/testing_nock')
  .reply( 412
   , "{ \"error\":\"file_exists\""+
      ", \"reason\":\"The database could not be created, the file" +
      " already exists.\"}\n"
   , { server: 'CouchDB/1.1.1 (Erlang OTP/R14B04)'
   , date: 'Thu, 01 Dec 2011 17:43:30 GMT'
   , 'content-type': 'application/json'
   , 'content-length': '95'
   , 'cache-control': 'must-revalidate' })
  .post('/testing_nock', "{\"foo\":\"bar\"}")
  .reply(201
   , "{ \"ok\":true" +
     ", \"id\":\"8b787a6a1c2476ef9a2eed069e000ff0\"" +
     ", \"rev\":\"1-4c6114c65e295552ab1019e2b046b10e\"}\n"
   , { server: 'CouchDB/1.1.1 (Erlang OTP/R14B04)'
   , location: 'http://nodejsbug.iriscouch.com/testing_nock/'
     + '8b787a6a1c2476ef9a2eed069e000ff0'
   , date: 'Thu, 01 Dec 2011 17:43:31 GMT'
   , 'content-type': 'application/json'
   , 'content-length': '95'
   , 'cache-control': 'must-revalidate' });

test('Insert a Document Into CouchDB', function(t) {
  t.plan(4);
  nano.db.create('testing_nock', function () {
    db.insert({foo: "bar"},
      function ensure_insert_worked_cb(err, doc) {
        t.notOk(err, 'No errors');
        t.ok(doc.ok, 'Contains ok');
        t.ok(doc.rev, 'Rev exists');
        t.ok(doc.id, 'Id exists');
      });
  });
});

all working, happy nocking! :)

$ node test.js 
# Insert a Document Into CouchDB
ok 1 No errors
ok 2 Contains ok
ok 3 Rev exists
ok 4 Id exists

1..4
# tests 4
# pass  4

# ok



couchdb in the browser vs. indexed database api

there's something fundamentally wrong with the way we do browser apps in javascript which can be described in a single sentence: you can't use a local database. or a local search engine.

even if you try to build your own abstractions you will never be able to build something that can work minimally well as databases go. e.g. javascript doesn't allow you to create custom datatypes that are optimized for database processing, such as btrees and stuff.

the indexed db api tries to formulate the minimal common denominator, the things that browser vendors must provide to you so you build your own databases using javascript.

this is much better than what we have right now so you would expect a database geek such as myself to feel ecstatic about the indexed db api. i am happy about it but still have some really big concerns about it:

indexed db api is not a finished product. developers need something that can sync our data to another machine, that can store json, and where we can run some queries. maybe something that can gives us push notifications in a flexible way? the indexed db api focus in none of these things.

the indexed db api specification is heavily relational geared in nature. one might say i'm being untrue, but if they are they probably never worked in a document database. you can read about cursors, transactions, and all sort of things that i would not expect to be in early drafts of a database api for the web. at least i wouldn't.

the indexed db api is years from mainstream usage when we need it now! this is my biggest pain point with it: we need to wait for a recommendation, then people will build products (hopefully). then we need to wait for browsers to catch up and then we need to wait for users to upgrade their browsers. not easy considering how many people still use ie 6 today.

it looks like we are in the html5 vs. xhtml standoff again. this is taking way to long for thing we needed yesterday. i for once think html5 was a great thing and broke free from endless boring discussion about making the perfect markup language.

i for once would just vote for putting couchdb in the browser and standardize it's http api and replication engine. couchdb is already an apache project and is extremely successful in doing the stuff we need to do in web applications. why reinvent the wheel? take it and build your standard around it, not some relational biased idb that people will then implement couchdb on top of.

i would love to hear why this is such a terrible idea. base the standard on something that works today and modify it accordingly to what the browser needs?

ps i have no interest in contributing to confusion or instigate anyone. my sole intention is to have this shit running before i have grey hair and look like jack nicholson. if you are looking for trolling i suggest you look somewhere else.




so you think you can build a document database?

we all know how relational databases work. we all know very well how to solve the problem of squeezing data into tables and getting answers out of it using the old sql dialect.

but what about when we have a document database? how can we allow our document to remain in their original shape and still get any answer we want using newer database dialects like xquery or javascript? how would you engineer a database for unstructured data?

many have tried. search engines do it by... not being a database! they give away query time flexibility so you can index massive amounts of textual documents. if you want to do a text search, they're great, but if you want to treat documents like a database - issuing ad hoc queries that understand the document structure - they can't.

other document databases like couchdb create something like serialized views of the data that give you query performance at the cost of ad-hoc queries. others like mongodb allow you to create relational-like indexes on top documents in a somehow flexible way by giving up on transactional guarantees. if you want ad-hoc queries and transaction guarantees, you need something else. if you want full-text search you also need something else.

in marklogic we pride ourselves in having a high throughput, acid compliant, fast ad-hoc query engine supported by both inverted indexes (that make marklogic look like a search engine) as well as range indexes (which are more common in relational-land).

marklogic doesn't make compromises. you can issue ad-hoc queries that understand the document structure. you can have transactional guarantees. you can run full-text queries, or database-style value or scalar queries, all in one and with acid guarantees.

in berlin i got the chance to introduce our architecture in a session named "acid transactions at the pb scale with marklogic server". i invite you all to watch it and challenge me with your questions.

if afterwards you feel like you wish you knew (even more) about how marklogic works feel free to check the inside marklogic server white-paper!

happy monday guys!




getting started with nodejs and couchdb

after seeing some questions on stack-overflow about getting started with couchdb and nodejs decided to give it a go at answering one of them. hopefully this will help other people with similar issues!

let's start by creating a folder and installing some dependencies:

mkdir test && cd test
npm install nano
npm install express

if you have couchdb installed, great. if you don't you will either need to install it setup a instance online at iriscouch.com

now create a new file called index.js. inside place the following code:

var express = require('express')
   , nano    = require('nano')('http://localhost:5984')
   , app     = module.exports = express.createServer()
   , db_name = "my_couch"
   , db      = nano.use(db_name);

app.get("/", function(request,response) {
  nano.db.create(db_name, function (error, body, headers) {
    if(error) { return response.send(error.message, error['status-code']); }
    db.insert({foo: true}, "foo", function (error2, body2, headers2) {
      if(error2) { return response.send(error2.message, error2['status-code']); }
      response.send("Insert ok!", 200);
    });
  });
});

app.listen(3333);
console.log("server is running. check expressjs.org for more cool tricks");

if you setup a username and password for your couchdb you need to include it in the url. in the following line i added admin:admin@ to the url to exemplify

, nano    = require('nano')('http://admin:admin@localhost:5984')

the problem with this script is that it tries to create a database every time you do a request. this will fail as soon as you create it for the first time. ideally you want to remove the create database from the script so it runs forever:

var express = require('express')
   , db    = require('nano')('http://localhost:5984/my_couch')
   , app     = module.exports = express.createServer()
   ;

app.get("/", function(request,response) {
    db.get("foo", function (error, body, headers) {
      if(error) { return response.send(error.message, error['status-code']); }
      response.send(body, 200);
    });
  });
});

app.listen(3333);
console.log("server is running. check expressjs.org for more cool tricks");

you can now either manually create, or even do it programmatically. if you are curious on how you would achieve this you can read this article i wrote a while back nano - minimalistic couchdb for node.js.

for more info refer to expressjs and nano. hope this helps!




nano — minimalistic couchdb client for nodejs

in some of my nodejs projects i was using request to connect to couchdb. as mikeal rogers, the author of request, would have said couchdb and nodejs are a perfect fit. i would argue that request is the perfect glue to binds nodejs and couchdb together. request is easy to use, and easy to reason with when you hit a problem.

one of the coolest things about request is that you can even proxy request from couchdb directly to your end user using nodejs stream#pipe functionality.

after doing development like this for a while some obvious patterns started to emerge, as well as some code duplication. so the idea of nano was born: build the minimal abstraction possible that allows you to use couchdb from nodejs while preserving stream#pipe capabilities.

the result is a very clean code base based entirely on request.

show me the code

for your convenience i added all the code snippets to a gist.

you can install nano using npm:

mkdir nano_sample && cd nano_sample
npm install nano

if you don't have couchdb installed i would recommend using iris couch. you can sign up in less than a minute and you will have your couchdb up and running.

now we can give nano a try:

node
var nano = require('nano')('http://localhost:5984');
nano;
// { db: 
//  { create: [Function: create_db],
//    get: [Function: get_db],
//    destroy: [Function: destroy_db],
//    list: [Function: list_dbs],
//    use: [Function: document_module],
//    scope: [Function: document_module],
//    compact: [Function: compact_db],
//    replicate: [Function: replicate_db],
//    changes: [Function: changes_db] },
// use: [Function: document_module],
// scope: [Function: document_module],
// request: [Function: relax],
// config: { url: 'http://localhost:5984' },
// relax: [Function: relax],
// dinosaur: [Function: relax] }

one cool thing about nano is that you don't have to learn about errors: they are proxied directly from couchdb. so if you knew them in couchdb, you know them in nano. the only error nano introduces is a socket error, meaning the connection to couchdb failed.

this makes it super easy for someone that knows couchdb to use nano.

one common pattern i see in people developing couchdb centric applications is lazy creation of databases. in other words you try to create a document, if the database doesn't exist then you create a database and retry. let's see how that would work in nano:

// don't forget to add your credentials if you are not in admin party mode!
var nano = require('nano')('http://localhost:5984');
var db_name = "test";
var db = nano.use(db_name);

function insert_doc(doc, tried) {
  db.insert(doc,
    function (error,http_body,http_headers) {
      if(error) {
        if(error.message === 'no_db_file'  && tried < 1) {
          // create database and retry
          return nano.db.create(db_name, function () {
            insert_doc(doc, tried+1);
          });
        }
        else { return console.log(error); }
      }
      console.log(http_body);
  });
}

insert_doc({nano: true}, 0);

we use nano.use(db_name) to instruct nano to operate on that database. in nano all callback return three arguments: 1) errors, 2) http headers returned from couch, 3) the http body. that's why we can say if(error.message === 'no_db_file' && tried < 1): because we get the error message that was proxied from couchdb. here's a gist with some verbose output from the execution of this code.

if you are an absolute beginner in nodejs there's two things here that might confuse you:

because nano is minimalistic it doesn't try to support every single thing you can do in couchdb. the way nano allows you to extend that functionality is by using the request method:

var nano = require('nano')('http://localhost:5984');
nano.request({db: "_uuids"}, function(_,uuids){ console.log(uuids); });

hello pipe!

let's try to use nano to pipe something from couchdb using the express.

npm install express
npm install request

we need something we can pipe out, so let's pipe the nodejs logo into couchdb:

node
// alias for require('nano')('http://localhost:5984').use('test');
var db      = require('nano')('http://localhost:5984/test');
var request = require('request');

// {} for empty body as parameter is required but will be piped in
request.get("http://nodejs.org/logo.png").pipe(
  db.attachment.insert("new", "logo.png", null, "image/png")
);

if you visit futon (i.e. localhost:5984/_utils/) you should be able to see the nodejs logo inside the test database, in document new, in an attachment called logo.png.

what if instead we want to pipe the attachment from couchdb to the end user?

vi index.js
var express = require('express')
  , nano    = require('nano')('http://localhost:5984')
  , app     = module.exports = express.createServer()
  , db_name = "test"
  , db      = nano.use(db_name);

app.get("/", function(request,response) {
  db.attachment.get("new", "logo.png").pipe(response);
});

app.listen(3333);

now go to your browser and visit localhost:3333. you should be able to see the nodejs logo!

hope you had fun following this little experiment — feel free to ask questions in the comments.




why sql sucks for nosql unstructured databases

as some of my readers know i have now worked in two document databases: ibm purexml, a native xml database built on top of a relational engine (pun intended) that offers both relational (sql/xml) and unstructured (xquery) query languages, and marklogic, a database built from scratch on a new database paradigm (call it nosql if you like) that understands unstructured data and offers an unstructured query language (xquery).

another relevant tidbit of information is this emerging trend amongst nosql database vendors to implement sql (or sql-like interfaces). an example would be the recent push on cassandra with cql, or even the more mature hadoop based sql interfaces. i see this as nosql trying to grow enterprise which overall is a good thing.

i'm not going to argue on whether these nosql vendors are doing the right choice with sql, or even to talk about the fact that enterprise is about more than just bolting on a sql interface. i'm also not going to discuss why some data models lend themselves better to sql than others, e.g. cassandra vs. mongodb (but if you want to discuss those topics just leave a comment).

in this post i'll focus on some lessons learned about mixing the worlds of relational and unstructured databases.

when the two worlds collide

nosql is about no sql. what this means to me is a shift of focus towards non-relational database alternatives that might even explore different interfaces to the database (and not caring about being politically correct). that is a good thing! blindly accepting the suckyness of sql for the sake of enterprise? well, even if sql is the right choice for your product, you still need to reason about the consequences and make sure things are well aligned between the two worlds. in other words, it means removing the "blindly" part and reducing the "suckyness" to a bearable minimum for your developers.

but be warned: things will get messy. sql isn't pretty and it's about to collide with the awesome unstructured truck (slightly biased)!

calvin and hobbes

data model

in relational you have:

  RowSet -> SQL -> RowSet

a rowset is something like:

 RowSet -> Item+
 Item   -> INT | VARCHAR n | ...

i'm unaware of a data model for json so i'll talk about data model i'm fairly familiar with: the xpath data model:

 XDM -> XPath/XQuery -> XDM

and the xdm is something like:

 XDM        -> Item+
 Item       -> AtomicType | Tree
 AtomicType -> integer | string | ...
 ...

(both these definitions are oversimplified but serve the purpose).

a thing that is different about a data model for document is that trees are not flat:

{
  "namespace": "person-2.0",
  "comments": "This guy asked me for a dinosaur sticker. What a nutter!",
  "person": {
    "handle": "dscape",
    "comments": "Please do not send unsolicited mail."
  }
}

so there's multiple interpretation to what this could mean:

SELECT comments from PERSON where handle = "dscape"

what "comment" element is the query referring to? if you look at sql/xml (which is a terrible, terrible thing) your query would be something like:

SELECT XMLQuery('$person/comments')
FROM PERSON
WHERE XMLExists('$person/person/handle')

which brings me to this obvious conclusion: trees need a way to navigate. in xml that is xpath, in json maybe that will be jsonselect, maybe something else. but you still need a standard way to navigate first.

something that makes this challenge even more interesting is schema versioning and evolution. while this has been ignored for ages in relational world (with serious business implications due to downtime during those fun alter-table moments), it really, really, really can't be ignored for documents. think of microsoft word - how many different document versions do they support? word 2003, 2005, etc..

schema-less, flexible, unstructured: pick your word but they all lend themselves to quick evolution of data formats. in this query we assume that handle is a child of person, and that the comments about me being an idiot are a direct descendent of the tree. this is bound to change. and sql doesn't support versioning of documents, thus you will have to extend it so it does.

a true query language for unstructured data must be version aware. in xquery we can express this query as something like:

declare namespace p = "person-2.0" ;

for $person in collection('person')
let $comments-on-person := $person/p:comments
where $person/p:handle = "dscape"
return $comments-on-person

frankenqueries by example

someone once referred to me (talking about sql/xml) as those frankenqueries. the term stuck to my head up until now. let's explore that analogy a little further and look for the places where the organic parts and bolts come together.

let's imagine two shopping lists, one for joe and one for mary

marys-shopping.json
{ "fruit": {
  "apples": 2
}, "apples": 5 }

joes-shopping.json
{ "fruit": {
  "apples": 6,
  "oranges": 1
} }

now with my "make believe" sql/json-ish extension i do:

SELECT apples FROM LISTS

what does this return? remember rowset goes in, rowset comes out?

2, 5
---
6

so, even though you are clearly asking for a list of quantities of apples, you get two rowsets instead of three, and one of the rowsets will have a list of quantities. if you instead decide to return three things, you had two rowsets come in and three rowsets come out. i'm no mathematician but that doesn't sound good.

once again this is not a problem if you use something that can deal with unstructured information. you don't have this problem in javascript and certainly won't have it in xquery. in both javascript and xquery it's all organic. (or bolts if you prefer)

conclusion: the awesome languages for unstructured data, unicorns and pixie-dust!

while xquery is a great language for unstructured information my point here is not advocating for it's use. the point i'm trying to make is the need for a real language for unstructured data, whatever you (read: the developers) choose it to be.

but i do ask you (developers) not to accept the "suckyness of sql" back. she's gone and you have this new hot date called nosql. just give it some time and it will grow on you. plus it's lots of fun writing javascript code that runs on databases: don't let them take that away from you.

sql for unstructured data will fail. then pl-sql for unstructured data will fail. so if a vendor is pushing something your direction don't settle for anything less than a full fledged programming language: you can write your full app in javascript and store it in a couchapp, or you can write your full app in xquery and store it in marklogic. and it should remain like that!

here's a checklist of things to look for on a query language for unstructured information (feel free to suggest some more):

you can choose to ignore this advice but you might end up feeling like a frustrated silverlight developer. and we, the guys that love to innovate in databases, will feel frustrated that the developers have chosen to accept the suckyness back!

see you at open source bridge

if you want to talk more about this topic i would like to invite you to join me, j chris anderson (couchdb) and roger bodamer (mongodb) at open source bridge in portland this month. we will be hosting a panel-ish un-talk about data modeling in a session called no more joins. so go on register and we will see you there!




berlin buzzwords recap

this week i had the pleasure to participate in the second edition of the buzzwords conference in berlin where marklogic was invited to give a talk called acid transactions at the pb scale with marklogic server.

hallo berlin!

let's start with a nice surprise: the city of berlin. a city filled with great points of interest, energy and a great positive hacker mentality.

some highlights would include a visit to the tempelhof airport, an airport transformed into a park, and the soviet war memorial, a burial ground for some soldiers whose words won't make justice. you can go visit c-base, a uber-geek club where the buzzwords bbq was held, and then proceed to the normal touristy attractions like the berlin wall, checkpoint charlie, etc.

if you are looking for cool places to have a chat you can stop by the barn for coffee, grab a meal in the hackescher markt area or even go for a cocktail near görlitzer park. but overall, berlin is amazing and you should visit.

fair warning: you might not want to leave! :)

conference

for me there are two things that make a conference: people and people. maybe that's why i'm naturally more drawn to un-conferences (maybe next year buzzwords will have an un-conference track - who knows!)

in that regard buzzwords was a pleasant surprise. long time since i've met so many new faces in a single event; plus they are all either search or database geeks. from the surprising openness of ryan betts (voltdb) about their lock-free transactional model and implications, passing by dale harvey's insight into couchbase and the future of couchdb, to tim anglade's talk on bigcouch failures and lessons learned, the venue was packed with great people and information. if tim's name is familiar to you that's probably because of his famous bloch and hunter nosql tape.

the overall target audience of this conference was search and database implementors so the sessions were naturally focused on topics like sharding, scoring and cool things like global idf calculation on a sharded environment. andrzej bialecki's talk was right on the money and was the technical highlight of the conference for me. ok, technical tie with ryan's talk! :)

talk

if you are curious about the talk you can get some related materials at lanyrd. this includes the slides plus code used in the demos. the session was recorded and will be added here once it's available.

thank you!

the event went extremely well. no problems with schedule, internet worked great, speakers were awesome and very friendly to interact (hopefully i did ok too), and there was an overall good positive vibe in the room! because organizing an event like this is no easy task, a sincere thank you for such a great event to simon, sylvia, julia, isabel and jan.

so next year don't forget to register and pay berlin a visit!




transactions at the pb scale with marklogic server

talk from berlin buzzwords 2011, you can watch the video here




jsbbq.org and some portuguese poetry

attended jsbbq yesterday. it was fun, mikeal does a great job at organizing stuff.

just finished translating another portuguese poem, this time by fernando pessoa:

 to be great, be whole; 
 exclude nothing, exaggerate nothing that is not you. 
 be whole in everything. put all you are 
 into the smallest thing you do. 
 so, in each lake, the moon shines because it blooms up above.

 ricardo reis, odes



what goes up can't come down?

i would call this my personal summary of the "innovators dilemma", a book i've recently read.

on a recent conversation some asked "so, what goes up and can't come down"? "fuel prices, right?".

while this isn't particularly funny it made me think about why despite fluctuations in the petrol cost gas prices seem to consistently increase. easy to give a quick dismissive answer but i kept thinking about it - mostly because i was bored and reading a book would be considered rude.

in a company like bp or mobil you are indeed expected to "always go up and never go down". and to make the forecast come true you have to focus on opportunities that allow you to reach the projected growth - if you are a 200m/year company and forecast a yearly growth of 10% you have to make extra 20m to reach those projections. failing to do so in a public company can be catastrophic - even good companies with excellent track records can fail after two successive bad quarters. at the very least it can impact their business which impacts their ability to remain competitive.

while this might seem perfect to the capitalist in you it really isn't. simply because there isn't such thing as perpetual growth. basic intuition tell you that everything that goes up does come down. it might take more or less time but one thing our world seems to be pretty good at is renewing itself. we all know the higher you go the higher you can fall.

totally made up graph

i've drawn this graphic - it doesn't represent any actual market data and it is indeed completely made up. however it makes it real easy to illustrate the opportunities for growth in a market (diverging stage) and how those opportunities slowly start to fade away (converging stage). be fooled not my friends, this takes decades to happen even in a fast changing business like it.

so what does this mean? regardless of if the market is growing or shrinking we probably have space for our growth for the next few quarters. if we act effectively it means we can focus on areas that yield more profits by improving our processes and focusing on the things that the customer values. but as the market starts to shrink opportunities get scarce and competition fearless. and companies begin to fail.

one question that the curious reader might have in mind is "why does the market goes down?". well renovation happens. using our analogy of a renewing world a disruptive innovation would play the part of the new. when a new technology is "born" it doesn't solve the older problem, it normally under-performs established technologies, and customers can't understand it or see a need for it. "it can't even run for god sake". but as time goes and the technology substitution happens emerging markets grow as the old markets start to shrink. if you are a database geek like me you are probably thinking about the relational databases and nosql. nosql databases are clunky, can't do transactions or joins right? but the offer a different value on searching against large corpse or unstructured information at scale. and while this isn't exclusive from a relational database it means that some part of the old relational market will be, in the future, consumed by nosql.

totally made up graph now with disruptive innovation

a pragmatic solution is to prepare for the disruptive innovation. however there is something about innovation that makes this harder: it's unpredictable by nature.

also think about it from our growing company perspective: they need to grow 10%. to invest in the disruptive technology would mean allocating resources to a less profitable market where you have no predictability on the outcome. any responsible leader would opt to look for opportunities in markets that allow him to reach the growth figures and will have to disregard/postpone the disruptive innovation opportunity. and she/he will do it again. and again. and again. until it's too late.

when new energy alternatives replace oil in mainstream markets and your customers come asking for it most of the successful oil companies will fail. new firms that focused on at what was at start a small emerging market of alternative energy will be the new energy suppliers of the world. renewal happened again.

but can one prevent this? studies suggest that the most efficient approach is to create independent organizations that operate exclusively in the new emerging markets. this way when the old market and new market meet in terms of revenue the independent organization can take over and it's revenue stream can keep the parent organization alive while restructuring happens. all without loosing your mainstream market leadership.

if you have enough money you can do what ibm, oracle and google do and buy anything that has a pulse so they never make you fail.

when looking at teams/companies that are not making profit or any measurable gain you have to understand that measure and mathematics only takes you so far. the only thing that is predictable are the things you already know. innovation, by definition, isn't.

so if you are working in cool tech and aren't just growing boring margins remember this: prepare to fail, prepare to be wrong all the time, respect the laws of nature, iterate fast, and don't make costly strategic decisions based on data you do not have.

if you are working in the predictable growing margins department remember that the guys next door might look like they don't know what their doing. that's not because they are unprepared or doing something wrong. it's the nature of your their job. if they were just applying the same models as you are they would fail miserable and cost you a fortune while at it. so be thankful for innovation and people willing to take the chance. in the end it's a lot less risky than it seems.




introducing rewrite - url rewriter for marklogic server

the best url rewriter for marklogic server:

<routes>
  <root> dashboard#show </root> 
  <resource name="inbox"> <!-- no users named inbox --> 
    <member action="sent"/> 
  </resource> 
  <resource name=":user"> 
    <constraints>  
      <user type="string" match="^[a-z]([a-z]|[0-9]|_|-)*$"/> 
    </constraints> 
    <member action="followers"/> <!-- no repo named followers --> 
    <resource name=":repo"> 
      <constraints>  
        <repo match="^[a-z]([a-z]|[0-9]|_|-|\.)*$"/> 
      </constraints> 
      <member action="commit/:commit"> 
        <constraints>  
          <commit type="string" match="[a-zA-Z0-9]+"/> 
        </constraints> 
      </member> 
      <member action="tree/:tag" /> 
      <member action="forks" /> 
      <member action="pulls" /> 
      <member action="graphs/impact" /> 
      <member action="graphs/language" /> 
    </resource> 
  </resource>
</routes>



cântico negro (black chant) - josé régio

josé régio is a portuguese poet. this is one of my favorites poems.

"come this way" — some say with sweet eyes
opening their arms, and certain
that it would be good if i would listen
when they say: "come this way"!
i look at them with languidly,
(my eyes filled with irony and tiredness)
and i cross my arms,
and i never go that way...
this is my glory:
to create inhumanity!
to accompany no one.
— for i live with the same unwillingness
with which i tore my mother's womb
no, i won't go that way! i only go where
my own steps take me...
if to what i seek to know no one can answer
why do you repeat: "come this way"?

i rather crawl thru muddy alleys,
to whirl in the wind,
like rags, to drag my bleeding feet,
than to go that way...
if i came to this world, it was
only to deflower virgin forests, 
and to draw my own footsteps in the unexplored sand! 
all else i do is worth nothing.

how can you be the ones
that give me impulses, tools and courage
to overcome my own obstacles?
the blood of our ancestors runs thru your veins,
and you love what is easy!
i love the far and the mirage,
i loves the abysses, the torrents, the deserts...

go! you have roads,
you have gardens, you have flower-beds,
you have a nation, you have roofs,
and you have rules, and treaties, and philosophers, and wise men.
i have my madness!
i hold it high like a torch burning in the dark night,
and i feel foam, and blood, and chants on my lips...
god and the devil guide me, no one else!
everyone's had a father, everyone's had a mother; 
but i, who never begin or end,
was born of the love between god and the devil.

ah! don't give me sympathetic intentions!
don't asks me for definitions!
don't tells me: "come this way"!
my life is a whirlwind that broke loose,
it's a wave that rose.
it's one more atom that ignited... 
i don’t know which way i’ll go,
i don't know where i'm going to,
- i know i'm not going that way!

original version (portuguese):




francesinha recipe

sauce

heat up a little bit of olive oil with chopped onions. (almost all portuguese recipes start like this). when onions are golden add tomatoes (many) add 2 bay leaves, beer, one chili pepper (chopped for extra spicyness), a little meat or sausage for meat flavor, a knorr (either chicken or beef according to what your cooking in the sandwich), and a little smell of a strong drink like whiskey, brandy, port or one you would think would fit. all sauce recipes are a bit different, because some people like different flavors, stronger or less strong, spicy or not so spicy. some like to add a little bit of seafood sauce, which gives a slightly different taste but works fine for some people. you have to look out for it not to be to liquid. thats where you could add the cornstarch to make it thicker. this provides a quick fix for not getting it right i suppose in terms of thickness.

sandwich

choose either a chicken, beef or porf steak. either is fine and works. grill it or fry, you choose. reserve to the side.

get a portuguese linguiça. fry it, maybe with just a little bit of soya sauce, reserve it near the steak.

get a sausage of your preference. fresh or wiener, either is used. fry it, reserve to the side.

grill a shrimp. reserve.

french fries

do some. your favorite but must retain the sauce.

assembly

get a soup plate. put a sandwich slice in the bottom. now put a slice of smoked ham, then cheese, then the steak, then chourição, then the grilled linguiça and sausages.

now close the sandwich with another slice. put the shrimp in the middle of the bread. add cheese slices on the top.

now put it in the microwave for a time between 30/60s.

cheese is melted. put the sauce on top.

lookout for ingredients. its better not to have it than to have one that changes the flavor and might make it bad. and they are hard to find in the states.




marklogic and the universal index

slides from the talk i gave at nosql frankfurt 2010




generate xml from an html form

one of the cool things about xforms is that i can abstract the data model from the form and get a consistent view of my xml. for me this is the killer feature about xforms. however, regular html forms are way more pervasive and i found myself thinking on how i could implement this feature in standard html.

in xforms we have a model (which is xml) and also a form that acts on that model. so our form "knows" the xml structure. in html forms there's no notion of data model implicit, or anything like that. what is submitted from an html form is a set of key value pairs.

in this little article we are going to design an application that can insert and search multiple choice questions using html. the html form will be responsible for the insert. the search will be tackled with application builder in part two of this article.

part 1: creating the form

for the sake of this demonstration let's assume 'option_a' is always the correct option, thus avoiding another control. this is ok as we can randomize this list in the server side once we receive the options.

so while in xforms we would submit something like:

<question>
  <text>which of the following twitter users works for marklogic?</text>
  <answer>
    <a>peteaven</a>
    <b>jchris</b>
    <c>stuhood</c>
    <d>antirez</d>
  </answer>
</question>

in regular html you have something like:

> POST / HTTP/1.1
> Content-Type: application/x-www-form-urlencoded
   question=Which of the following twitter users works for MarkLogic?
   &option_a=peteaven&option_b=jchris&option_c=stuhood&option_d=antirez

while this can map perfectly to a relational database it doesn't play well with xml. let me rephrase this: there are multiple ways you could shape it as xml.

one possible solution is to name the fields with an xpath expression and then generate an xml tree out of this path expression.

once we solve this we have two options on how to generate the xml from xpath: do some work with a client-side language like javascript and produce the xml that is sent to the server or simply submit the form and create the xml on the server-side with xquery. i choose the second approach for two reasons:

  1. to push the xquery high order functions support in marklogic server to the limit and learn how far it can go.
  2. other people might have a similar problem that needs to be solved in the server side. this way they can reuse the code.

high order functions are functions that take functions as parameters.

two examples of such functions are fold (a.k.a. reduce or inject) and map (a.k.a. collect or transform).

fold is a list destructor. you give it a list l, a starting value z and a function f. then the fold starts accumulating the value of applying f to each element of l in z. map is a function that applies a function f to each element of a list.

an example of a fold might be implementing sum, a function that sums the contents of a list:

# in no particular language, pseudo code
sum l = fold (+) 0 l

an example of a map is multiply every element in a list by two:

# in no particular language, pseudo code
double l = map (2*) l

a fold is really just a list destructor. but you can generalize it for any arbitrary algebraic data types. you call these "generic folds" a catamorphism. actually a fold is just a catamorphism on lists.

implementing these functions in marklogic xquery 1.0 with recursion is really easy:

declare function local:head( $l ) { $l[1] } ;
declare function local:tail( $l ) { fn:subsequence( $l, 2 ) } ;
declare function local:fold( $f, $z, $l ) { 
  if( fn:empty( $l ) ) then $z
  else local:fold( $f,
                   xdmp:apply( $f, $z, local:head( $l ) ),
                   local:tail( $l ) ) } ;

declare function local:map( $f, $l ) {
  for $e in $l return xdmp:apply( $f, $e ) } ;

declare function local:add($x, $y)         { $x + $y } ;
declare function local:multiply($x, $y)    { $x * $y } ;
declare function local:multiply-by-two($x) { $x * 2 } ;

(: sums a list using fold :)
declare function local:sum( $l ) {
  let $add      := xdmp:function( xs:QName( 'local:add' ) )
  return local:fold( $add, 0, $l ) } ;

declare function local:double ( $l ) {
  let $multiply-by-two := 
    xdmp:function( xs:QName( 'local:multiply-by-two' ) )
  return local:map( $multiply-by-two, $l ) } ;

(: factorial just for fun :)
declare function local:fact($n) { 
  let $multiply := xdmp:function(xs:QName('local:multiply'))
  return local:fold($multiply, 1, 1 to $n) };

(: This is the main part of the XQuery file
 : Illustrating the fold and map from the previous listing :)
<tests>
  <!-- fun facts: http://www.mathematische-basteleien.de/triangularnumber.htm -->
  <sum> { local:sum(1 to 100) } </sum>
  <fact> { local:fact( 10 ) } </fact>
  <double> { local:double( (1 to 5) ) } </double>
</tests>

so how can we use all of this to solve our xpath to xml problem? simple. we need to destruct the list of xpaths and generate a tree. in other words, we need to fold the list intro a tree.

if we go one level down an xpath is really a list of steps. once again we need to destruct that list to create each node. so we need a fold inside a fold.

we now need to iterate the list of field values, navigate to the corresponding node using the xpath expression, and finally replace the value of the node (empty at this point) with the value provided in the http form.

scared? wondering if we really need all this functional stuff? fear not, problem is solved and we will simply use a xquery library module that already exists to solve the problem! hooray.

the library is called generate-tree and is included in the dxc github project. to get it simply install git and:

git clone git://github.com/dscape/dxc.git

if you don't know what git is (neither you care) simply go to the project page at http://github.com/dscape/dxc and download the source.

if you are curious to see the implementation using the folds and everything you learned so far you can check the the gen-tree.xqy implementation at github. or as an exercise you can try and do it yourself! to run this code directly from cq i created another script that creates a tree while printing out debug messages. this might be useful to understand how the code is running without getting "lost in recursion".

create a folder called 'questions-form' and place the dxc code there:

njob@ubuntu:~/Desktop/questions-form$ ls -l
total 8
drwxr-xr-x 12 njob njob 4096 2010-08-13 20:51 dxc
-rw-r--r--  1 njob njob  149 2010-08-13 20:59 index.xqy

now we need to create the html form. for now simply create a file called index.xqy inside the 'questions-form' directory and insert the following code:

xquery version '1.0-ml';

"Hello World!"

in this listing we simply print hello world! to get our website online simply go the the marklogic server administration interface at http://localhost:8001 and create a new application server with the following parameters:

name: questions-form
port: <any port that is available in your system>
root: <full path of the directory where you have the index.xqy file>

in my case this will be:

port: 6173
root: /home/njob/Desktop/questions-form

if you have cq installed you can simplify the process by running the following script (remember to change the root. also change the port if necessary)

xquery version '1.0-ml';

import module namespace admin = "http://marklogic.com/xdmp/admin" 
  at "/MarkLogic/admin.xqy" ;

let $name       := "questions-form"
let $root       := "/home/njob/Desktop/questions-form"
let $port       := 6173
let $config     := admin:get-configuration()
let $db         := "Documents"
let $groupid    := admin:group-get-id( $config, "Default" )
let $new        := admin:http-server-create( $config, $groupid, $name, 
  $root, xs:unsignedLong( $port ), 0, xdmp:database( $db ) )
return ( admin:save-configuration( $new ) ,
         <div class="message">
           An HTTP Server called {$name} with root {$root} on 
           port {$port} created successfully </div> )

this is running against the default documents database. this is ok for a demonstration but in a realistic scenario you would be using your own database.

now when you visit http://localhost:6173 you will get a warm hello world!

now let's change the code to actually perform the transformation. to do so simply insert this code in index.xqy. feel free to inspect it and learn from it - i commented it just for that reason.

xquery version '1.0-ml';

(: First we import the library that generates the tree :)
import module namespace mvc = "http://ns.dscape.org/2010/dxc/mvc"
  at "dxc/mvc/mvc.xqy" ;

(: 
 : This function receives a string as the parameter $o
 : which will be either 'a', 'b', 'c' or 'd' and
 : generates an input field for the form
 :)
declare function local:generate-option( $o ) {
 (<br/>, <label for="/question/answer/{$o}">{$o}) </label>,
      <input type="text" name="/question/answer/{$o}" 
        id="/question/answer/{$o}" size="50"/>) };

(: This function simply displays an html form as described in the figures :)
declare function local:display-form() {
  <form name="question_new" method="POST" action="/" id="question_new">
    <label for="/question/text">Question</label><br/>
    &nbsp;&nbsp;&nbsp; <textarea name="/question/text" id="/question/text" 
      rows="2" cols="50">
    Question goes here </textarea>
  <br/>
  { (: using the generate option function button to generate four fields :)
    for $o in ('a','b','c','d') return local:generate-option( $o ) }
  <br/><br/><input type="submit" name="submit" id="submit" value="Submit"/>
   </form> } ;

(: this function will process the insert and display the result
 : for now it simply shows the tree that was generated from the HTML form
 :)
declare function local:display-insert() {
  xdmp:quote( mvc:tree-from-request-fields() ) } ;

(: Now we set the content type to text html so the browser renders
 : the page as HTML as opposed to XML :)
xdmp:set-response-content-type("text/html"),
<html>
  <head>
    <title>New Question</title>
  </head>
  <body> {
  (: if it's a post then the user submited the form :)   
  if( xdmp:get-request-method() = "POST" )
  then local:display-insert()
  else
    (: the user wants to create a new question :)
    local:display-form() }
  </body>
</html>

we are using the 'mvc:tree-from-request-fields()' function to create the tree from the request fields. however this function isn't described in gen-tree.xqy. this is declare in another library called mvc:

declare function mvc:tree-from-request-fields() {
  let $keys   := xdmp:get-request-field-names() [fn:starts-with(., "/")]
  let $values := for $k in $keys return xdmp:get-request-field($k)
  return gen:process-fields( $keys, $values ) } ;

now you can visit http://localhost:6173 again and you'll see our form. fill it accordingly to the following picture and click "submit"

this is how the document you inserted looks like:

<?xml version="1.0" encoding="utf-8"?>
<question>
  <text>which of the following twitter users works for marklogic?</text>
  <answer>
    <a>peteaven</a>
    <b>jchris</b>
    <c>stuhood</c>
    <d>antirez</d>
  </answer>
</question>

now let's augment our form with some more interesting fields like author and difficulty. this will help make our search application interesting. simply update the display-form function:

(: This function simply displays an html form as described in the figures :)
declare function local:display-form() {
  <form name="question_new" method="POST" action="/" id="question_new">
    <input type="hidden" name="/question/created-at" 
      id="/question/created-at" value="{fn:current-dateTime()}"/>
    <input type="hidden" name="/question/author" 
      id="/question/author" value="{xdmp:get-current-user()}"/>
    <br/> <label for="/question/difficulty">Difficulty: </label>
      <input type="text" name="/question/difficulty" 
        id="/question/difficulty" size="50"/>
    <br/> <label for="/question/topic">Topic:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    </label>
      <input type="text" name="/question/topic" 
        id="/question/topic" size="50"/>
    <br/><br/> <label for="/question/text">Question</label><br/>
    &nbsp;&nbsp;&nbsp; <textarea name="/question/text" id="/question/text" 
      rows="2" cols="50">
    Question goes here </textarea>
  <br/>
  { (: using the generate option function button to generate four fields :)
    for $o in ('a','b','c','d') return local:generate-option( $o ) }
  <br/><br/><input type="submit" name="submit" id="submit" value="Submit"/>
   </form> } ;

now we are missing the part where we actually insert the document in the database. for that we need to update the function that local:display-insert() function:

(: this function will process the insert and display the result
 : it then redirects to / giving you the main page
 :)
declare function local:display-insert() {
  try {
    let $question   := mvc:tree-from-request-fields() (: get tree :)
      let $author     := if ($question//author[1]) 
                         then fn:concat($question//author[1], "/") else ()
      (: now we insert the document :)
      let $_          := xdmp:document-insert(
        (: this fn:concat is generating a uri with directories
         : e.g. /questions/njob/2362427670145529782.xml 
         :)
        fn:concat("/questions/", $author, xdmp:random(), ".xml") , $question )
      return  xdmp:redirect-response("/?flash=Insert+OK")
  } catch ($e) {
    xdmp:redirect-response(fn:concat("/?flash=", 
      fn:encode-for-uri($e//message/text()))) } } ;

so far we talked about the problem, differences with xforms, proceeded to talk on high order functions and how to implement it in xquery and finally we got a working solution for our little problem. coming up next we are going to build an application to search these questions we can now insert with application builder. then we are going to take advantage of the new functionalities available in marklogic 4.2. to extend application builder with this form.




getting started with marklogic server

gave this talk at the marklogic user conference, san francisco, may 2010




build a purexml and json application

wrote build a purexml and json application: store and query json with db2 which was published in ibm developer works




rsa encrypt & decrypt in ruby

well i finished the encrypt with rsa on ruby some hours ago and felt like sharing

case you feel like doing something back for me just download the latest release of my beta twitter client and send me some comments to my email. it's pretty hard to test something when my environment is completely contaminated

require 'openssl'
require 'Base64'

class Rudolph
  class Crypt
    def initialize data_path
      @data_path = data_path
      @private   = get_key 'id_rsa'
      @public    = get_key 'id_rsa.pub'
    end

    def encrypt_string message
      Base64::encode64(@public.public_encrypt(message)).rstrip
    end

    def decrypt_string message
      @private.private_decrypt Base64::decode64(message)
    end

    def self.generate_keys data_path
      rsa_path = File.join(data_path, 'rsa')
      privkey  = File.join(rsa_path, 'id_rsa')
      pubkey   = File.join(rsa_path, 'id_rsa.pub')
      unless File.exists?(privkey) || File.exists?(pubkey)
        keypair  = OpenSSL::PKey::RSA.generate(1024)
        Dir.mkdir(rsa_path) unless File.exist?(rsa_path)
        File.open(privkey, 'w') { |f| f.write keypair.to_pem } unless File.exists? privkey
        File.open(pubkey, 'w') { |f| f.write keypair.public_key.to_pem } unless File.exists? pubkey
      end
    end

    private
    def get_key filename
      OpenSSL::PKey::RSA.new File.read(File.join(@data_path, 'rsa', filename))
    end
  end
end



rudolph: yet another twitter client

i felt like trying the shoes framework. here are the results so far. when i get some free time i'll try to post some guidelines to do something like this




apache couchdb

it's official: couchdb is an apache project. yay! great news

damien's post




mondrian k-anonymity in ruby

an implementation of this paper: mondrian multidimensional k-anonymity

you can get the code from github




haskell $

i was on the train with joão and i was delighted to see my old friend $. i also miss composite (.) but $ is really the coolest shortcut haskell gives a developer. so what is $?

it’s defined as:

f ($) x = f x

what does it does?

Prelude> let f x = map (succ) $ filter ( < 5 ) x
Prelude> f [4,5,7]
[5]
Prelude> let f  = zipWith ($)
Prelude> f [succ,id] [5,4]
[6,4]

in the first example we filter a list for numbers that are inferior to five and then we apply succ function to it. that is, we add one. without $ we would have:

Prelude> let f x = map (succ) (filter ( < 5 ) x)
Prelude> f [4,5,7]
[5]

so we got the parenthesis off and that always great to help make the code more readable. at least i simply love this symbol. the first sample is quite more complex. first off all because it is in point-free/point-less notation. zipwith is a function that receives two lists and applies then function provided pair by pair. like if i want to add [1,2,3] and [3,2,1] i can:

Prelude> zipWith (+) [1,2,3] [3,2,1]
[4,4,4]

ain’t it cool? so in this function we simply apply function that goes in the first list (id and succ) to the numbers in the second. looks easy like this doesn’t it? ;) if it doesn’t just to read it and digest it and you’ll figure it out easily

let code the same samples in ruby. unfortunately zipwith (should i commit it? :p) doesn’t exist in ruby i’ll have to work with another sample using plain zip (it’s the same as zipwith (\a b -> [a]++[b]))

irb(main):001:0> [1,2,3].zip([3,2,1])
=> [[1, 3], [2, 2], [3, 1]]

well ruby handles this pretty well without $. we just need to do:

irb(main):002:0> [1,2,3].zip [3,2,1]

because it’s object oriented this kind of issues don’t exist in ruby. there are no expressions with large number of parenthesis as well. despite this i must agree that the haskell version is far more readable than the ruby one:

irb(main):003:0> [4,5,7].select {
  |i| i < 5
}.map { |i| i.succ }
=> [5]

but i still miss $.i miss coding in haskell. it’s just plain fun. i hope that i have helped you see why languages like haskell and scheme do matter, and others like python and ruby can be both useful and fun to work with




couchdb and a new semester

now officially a commiter of the couchdb ruby driver! woot




butterfly effect in open-gl

i was asked to deliver some work on chaos theory and my group choose lorenz attractor as object of study. it’s a really nice chaos function as you can learn in the wiki page dedicated to the subject.the work was developed in opengl but i’m pretty sure that it would have been easier to do so in povray

if you want to give the application the project is hosted at github

you’ll need gcc, build-essentials and – who would of thought? – the opengl libraries. it’s all explained in a slightly demented readme file




point-free calculator in haskell

point-free is a style of programing in haskell where variables can be omitted. it's pretty useless except for the fact that it makes the code look real cool

we were given the task of:

we had to create an algorithm that would analyze those expressions and try to simplify them to a minimum number of expressions using the rules. this turns out to be nph so we had to develop some heuristics to solve it in real time

because this is not hard enough we had to define our own recursion schemes on recursive data types we created specifically for the expressions

this is analog to saying: define mathematic expressions in it's own datatype, create recursion methods for it (like reduce, map, etc) and then based on mathematic axioms use a computer to solve those expressions

describing the work in one word? kick ass! (ok that's two words)

check the result on github




a sudoku game in haskell

small game and gui for playing sudoku called su-doku-9x9. funny thing, i think sudoku is boring. plus did brute force cause a 9x9 grid is no challenge, ever. there's some cool strategies out there if you have enough time to investigate though