Automatically extract subtitles from MKV videos on Linux

I’m pretty happy with the built-in mediaplayer of my Samsung Smart TV. The UI is lame but effective — it’s really just a simple file browser — and it plays pretty much all the regular movie formats like MKV, Xvid, TS, and the other usual suspects of Usenet. There’s two big downsides though: it won’t play DVDs (ISO or VIDEO_TS folders), and it can’t handle subtitles inside MKVs. The latter is quite an inconvenience: you leech a video that has subtitles neatly stored inside the MKV container, but they won’t show up. If you put the subtitles in a seperate .SRT file, though, it plays them just fine. So to fix this annoying shortcoming, I threw together a small script that automatically extracts the subtitles from MKV files and saves them to a separate SRT file.

So

WarGames.1983.1080p.x264.mkv

turns into

WarGames.1983.1080p.x264.srt
WarGames.1983.1080p.x264.mkv

I run this as a post processing script in SABnzbd, but obviously you could hook it up to other leechers as well. Or just run it on a directory manually — it recurses through directories and liberate all subtitles from their MKV containers.

Here’s my instructions for running on Linux (Debian based). First, install MKVToolsNix, a tool to work with MKV files:

sudo aptitude install mkvtoolnix

MKV files can contain multiple subtitles, but I only need the Dutch one myself. So I do a language check and rename all non-Dutch subtitles. There’s checks in there for English, German and Spanish as well, so just un-comment the language check you want to have applied.

#!/bin/bash
# Extract subtitles from each MKV file in the given directory

# If no directory is given, work in local dir
if [ "$1" = "" ]; then
  DIR="."
else
  DIR="$1"
fi

# Get all the MKV files in this dir and its subdirs
find "$DIR" -type f -name '*.mkv' | while read filename
do
  # Find out which tracks contain the subtitles
  mkvmerge -i "$filename" | grep 'subtitles' | while read subline
  do
    # Grep the number of the subtitle track
    tracknumber=`echo $subline | egrep -o "[0-9]{1,2}" | head -1`

    # Get base name for subtitle
    subtitlename=${filename%.*}

    # Extract the track to a .tmp file
    `mkvextract tracks "$filename" $tracknumber:"$subtitlename.srt.tmp" > /dev/null 2>&1`
    `chmod g+rw "$subtitlename.srt.tmp"`

    # Do a super-primitive language guess: DUTCH
    langtest=`egrep -ic ' ik | je | een ' "$subtitlename".srt.tmp`
    trimregex="vertaling &\|vertaling:\|vertaald door\|bierdopje"

    # Do a super-primitive language guess: ENGLISH
    #langtest=`egrep -ic ' you | to | the ' "$subtitlename".srt.tmp`
    #trimregex=""

    # Do a super-primitive language guess: GERMAN
    #langtest=`egrep -ic ' ich | ist | sie ' "$subtitlename".srt.tmp`
    #trimregex=""

    # Do a super-primitive language guess: SPANISH
    #langtest=`egrep -ic ' el | es | por ' "$subtitlename".srt.tmp`
    #trimregex=""

    # Check if subtitle passes our language filter (10 or more matches)
    if [ $langtest -ge 10 ]; then
      # Regex to remove credits at the end of subtitles (read my reason why!)
      `sed 's/\r//g' < "$subtitlename.srt.tmp" \
        | sed 's/%/%%/g' \
        | awk '{if (a){printf("\t")};printf $0; a=1; } /^$/{print ""; a=0;}' \
        | grep -iv "$trimregex" \
        | sed 's/\t/\r\n/g' > "$subtitlename.srt"`
      `rm "$subtitlename.srt.tmp"`
      `chmod g+rw "$subtitlename.srt"`
    else
      # Not our desired language: add a number to the filename and keep anyway, just in case
      `mv "$subtitlename.srt.tmp" "$subtitlename.$tracknumber.srt" > /dev/null 2>&1`
    fi
  done
done

You can download the script here. Run it like so and your subtitles will automatically be liberated from their MKVs:

[email protected]:~$ ./ripsubtitles.sh /home/user/usenet/downloads

One extra note: Dutch subtitlers have the habit of putting in their credits or shout-outs in the last few lines of the subtitles. Nothing wrong with that, except when it happens right after the last line spoken in the film. The film might go on for another 5 minutes, I don’t want the aproaching end to be given away by DaNoodleBrain giving shout-outs to BoogerGuzzler, so I remove them with another simple regex.

Enjoy, and let me know if you have suggestions on how to improve this little script.

blog comments powered by Disqus