TBS – Distributing Transcoding

The issue at hand

Recently I’ve worked a lot on adding content to the TBS by parsing the intertubes auto magically. Fr instance, I have a tumblr and a twitter parser who allows me to gather data (especially in Egypt for instance). Even if those parsers are stupid, they works.

Another one I wante dto add, is the bambuser one. It’s a streaming services used a lot by people in Middle East to broadcast covergae of protests. The Bambuer team is great, they already provided us an API key for the first versions of the TBS, but they mainly use flv format for videos.

And I want the TBS to be without flash, so it means HTML5 formats, and there’s three of them: OGG (.ogv), WebM (.webm) and MP4 (.mp4). FLV is neither of those one.

I usually used to transcode them as a celery tasks, righ on the TBS, but the bambuser parsers gaves me 223 videos to transcode, and given my current configuration, and the CPU power needed to transcode from flv to ogv – it actually can take more than 4 days per video – I was stuck.

Also, since I don’t have a lot of CPU cores, I had only one celery worker, so the broadcast wasn’t updating itself, which was a shame.

Distribute work

So, the solution is to not transcode those videos myself. And that’s were you can help. I’ve wrote a little webservice, using tastypie RESTFull API.

The principle is simple, you ask for a job, download the flv vids from my server, transcode it in one of the three HTML5 video format, md5sum it, put it somewhere I can retrieve it (a publicly accessible http/https server will be good) and then PUT me an update.

See? Simple.

SO, let’s get into the dirty details.

First, you ask for a job to do by hitting this link: https://broadcast.telecomix.org/tsc/v1/jobs/todo/?format=json

It will answers you with a job to do:

{   "objects": [     {       "id": 399,       "md5sum": "dce2d12c90cfef2c78b6c5bde98b4c2c",       "resource_uri": "/tsc/v1/jobs/399/",       "start_time": "2013-09-18T16:16:32.587953",       "state": "p",       "token": "u5d98hOslRQbMJRVtCl6ocLzX5xeCFbneij75Y8j",       "uri": "https://broadcast.telecomix.org//media//8695.flv"     }   ] }

id: is the id of the job. md5sum: is the checksum of the file you need to transcode _resourceuri: the URI you can use to check the details of the job (appends it behind https://boradcast.telecomix.org) . It’s also where you’re going to need to put stuff into, after you’ve done the job. _starttime: is the time at which the jobs has been created. usuallay, you should have the oldest one to do. state: give you the current state of the job. It’s p in this case, because the job is in Progress (since you’re going to do it) token: it’s the token associated to this job ID, and it’s how I’ll fight spam. If you do’nt have the job ID and the token, then you can’t PUT anything. uri: is the absolute URI of the file I need you to transcode. Just GET this file.

And that’s all. You can now transcode the file. For the sake of giving an example, I’m generally using ffmpeg and I invoke it like that:

ffmpeg -i input_file.flv output_file.ogv

It’s enough, but if you’re a ffmpeg Guru, you can probably find better ways. I try to stay as close as possible from the original format (in size especially), but a 320×240 size shoudl be enough if you really need a size.

I tend to prefer ogv over webm and mp4, for it’s the most free codecs of the three, but do what you think is best I can manage the 3 of them.

Once you’re done, send me a PUT on the resource_uri using only three args.

Technically, add the ‘Content-type: application/json‘ header to your query. And the body needs to be a JSON formatted content, with only those three fields:

{     'md5sum': "The md5 hexdigest hash of your transcoded file",     'token': "The token associated to the job",     'uri': "the URL whee I can get the file you transcoded" }

Every other field, will leed to an error.

Once I got the PULL request, I’m going to GET your file. It would be nice to give me the ‘Content-type‘ header associated to the file. In fact, if it’s not one of ‘video/ogg’, ‘video/webm’, ‘video/mp4’ then, I’ll drop the file and will reinitialise the job for someone else to do it. So, please, set-up your webserver accordingly.

And once it’s done, you can get back to /todo and start another job.

If no more jobs are available, you’ll get a 404. Then wait for some time (days or hours) for new jobs to transcode.

And a wild client appears

I was working with CapsLock at night to bootsrap a client to automagically do all the stuff.

You’ll need ffmpeg − and, it seems you need to have a more recent than the one in Debian − and some basic python tools to run it.

Then just:

git clone https://git.legeox.net/capslock/tbs-client.git

And then run it using python in a classical fashion.

Neat, isn’t it? Now, you have no excuse for not helping to transcode the datalove.

If you have any questions, just ping me.

Thank for your help, your cores and your bandwidth. Datalove uppon you.

— UPDATE [2013/09/21]: One of teh field needed for the PUT (namely hash was wrong) UPDATE2 [2013/09/21]: Add the git repo for the client