47
Orchestrating the execution of workflows for media streaming service and even more Shuen-Huei (Drake) Guan sr. principal engineer, KKBOX vice chairperson, PyCon APAC 2015

Orchestrating the execution of workflows for media streaming service and even more

Embed Size (px)

Citation preview

Orchestrating the execution of workflows for media streaming service and even moreShuen-Huei (Drake) Guan

sr. principal engineer, KKBOXvice chairperson, PyCon APAC 2015

Who am I?

• administrator, Ptt BBS

• technical director / R&D manager, Digimax

• team player, KKBOX

• contributor, PyCon Taiwan

Rather a story than tech sharing.No any KKBOX trade secrets get revealed.

There're just some slides talking about Python.

And, it's not about music streaming.

350 team players to serve10M users across 6 countries

20M songs

Events

KKTIXthe always lovely sponsor!

If we can make music streaming work, how

about video streaming?— KKBOX CxO

Let's work on a video-on-demand service

• Adaptive streaming.

• DRM protection.

• Video processing on cloud.

We thought video streaming is similar to music streaming,

but we were wrong.

Issue 1. Workflow

multiple distinct interconnected steps that need to be executed in a particular order in a distributed environment...— someoneflickr:siddhu2020

flickr:siddhu2020 http://bit.ly/1FAukT2

Sample encoding workflow for music

def run(source, secret_key, cipher): # verify if the source is ok. if not verify(source): return False

# convert audio with different bitrates _ = [convert(source, i) for i in range(4)]

# update id3 tag for all converted audios _ = update_id3_tag(_)

# encrypt all audios _ = encrypt(_, secret_key, cipher)

# deploy to backend DB deploy(_)

return True

Issue 2. Distribute tasks to the cloud, and use the cloud

efficiently!

Gearman

Sample encoding workflow for music

Sample client code to submit a workflow1

$workflow = new Gearman_Workflow('KKBOX_Convert_Audio' 'source' => $source, 'args' => $args);

$workflow->attachCallback(function () {});

$client->run($workflow);

1 warning, it's PHP.

Sample worker (server) code to do things1

class KKBOX_Convert_Audio extends Gearman_Worker { public function run($arg) { // check the source if (!verify()) return; // convert audio with different bitrates for ($i=0; $i<4; $i++) { convert($i); } // update id3 tag for all audios update_id3_tag(); // encrypt audios encrypt(); // sequentially deploy to backend DB for ($i=0; $i<4; $i++) { deploy($i); }}

1 warning, it's PHP.

Sample encoding workflow for video, a little bit complicated

Sample worker (server) code to do things1

class KKBOX_Encode_Video extends Gearman_Worker { public function run($arg) { transcode(); encrypt(); }}

class KKBOX_Convert_Video extends Gearman_Worker { public function run($arg) { if (!verify()) return;

// create asynchronous sub-workflows $result = create_sub_workflow(KKBOX_Encode_Video); // wait for all sub-workflows to finish joint($result);

create_sub_workflow(KKBOX_Package_DASH, $result->encrypted); create_sub_workflow(KKBOX_Package_HLS, $result->plain); joint();

deploy();}

1 warning, it's PHP.

The real gearman worker code is way more complicated w/o elegance we like to have

Issue 3. Workflows would evolve...

• Let's save file size and IO.

• Let's make it faster.

• Let's add some more profiles.

• Let's fix some encoding.

Everything fails all the time.— Werner Vogels, CTO of Amazonflickr:Bill Abbott

flickr:Bill Abbott http://bit.ly/1GnrSGr

Issue 4. Gearman server down!

Factors we like to pay much attention in

• Encoding workflow

• Tasks distributing across machines on cloud.

• Server maintenance.

We hope ...

1. no need to maintain this system;

2. easier to distribute workflow/tasks, even to local machine;

3. with high-level workflow.As long as you can draw your processes on a paper, you can map it to a workflow!

What Google suggests us...

• Apache Kafka, Mesos, ...

• Gearman (sorry, but we've tried.)

• Luigi by Spotify

• Celery

• Potentially all message brokers with some additional work.

AWS Simple Workflow (SWF)

class HelloWorker(swf.ActivityWorker):

domain = DOMAIN version = VERSION task_list = TASKLIST

def run(self): activity_task = self.poll() if 'activityId' in activity_task: print 'Hello, World!' self.complete() return True

class HelloDecider(swf.Decider):

domain = DOMAIN task_list = TASKLIST version = VERSION

def run(self): history = self.poll() if 'events' in history: # Find workflow events not related to decision scheduling. workflow_events = [e for e in history['events'] if not e['eventType'].startswith('Decision')] last_event = workflow_events[-1]

decisions = swf.Layer1Decisions() if last_event['eventType'] == 'WorkflowExecutionStarted': decisions.schedule_activity_task(...) elif last_event['eventType'] == 'ActivityTaskCompleted': decisions.complete_workflow_execution() self.complete(decisions=decisions) return True

SWF

• Decider defines the workflow.

• We still need to write workflow logic in decider.

• Workers do the action.

• Everytime, we changed workflow or action, we need to re-deploy deciders and workers.

Let's de-couple the workflow and action out of SWF

Job script for a workflow

Job {KKBOX Convert Video} -subtasks { Task {Source Inspection} -cmds { Cmd { emilia verify -i s3://bucket/source.mp4 } }

Task {Transcode} --parallel -subtasks { Iterate i -from 0 -to 4 -by 1 -template { Task {Transcode Audio} -cmds { Cmd { ffmpeg -i s3://bucket/source.mp4 -o /tmp/converted_$i.mp4 } } } Iterate i -from 0 -to 8 -by 1 -template { Task {Transcode Video} -cmds { Cmd { ffmpeg -i s3://bucket/source.mp4 -o /tmp/converted_$i.mp4 } } } }

Task {Adaptive} -subtasks { Task {DASH} -subtasks { } Task {HLS} -subtasks { } Task {MSS} -subtasks { } }}

What is exactly a job script?

����

Make it pythonic if that makes developers happier

source = 's3://bucket/source.mp4'

with Job(): with Task('Source Inspection'): Cmd('emilia verify -i %s' % source)

with Task('Transcode', parallel=True): for i in range(4): with Task(): Cmd('ffmpeg -i %s ... -o /tmp/a_%d.mp4' % (source, i)) for i in range(9): with Task(): Cmd('ffmpeg -i %s ... -o /tmp/v_%d.mp4' % (source, i))

with Task('Adaptive'): with Task('DASH'): pass with Task('HLS'): pass with Task('MSS'): pass

Status

• 1,500,000-minute videos got encoded.

• 3,000 videos per day (max).

• 800 workers on 100 c3.8xlarge instances (max).

• spent lots of $.

• everyone is really happy for that performance.

Technical status

• Fault tolerance by retry. [decider]

• Workflow/task has priorities. [SWF]

• try..except..finally mechanism. [-whendone, -whenerror, -precmds, -postcmds, ...]

Question:Are you interested in this project?

To do:

• Use JSON or YAML for job script.

• A viewer to see the progress of workflows!

• Replace SWF by Apache Mesos or Mistral.

Thank You!@drakeguan