Friday 16 January 2009

New Years Resolution: Massive Music Tag Cleanup

Once again I find that months have passed since my last entry. The blog will be a year old in little over a week and I will once again be attending linux.conf.au, this time down in Hobart. I've got myself some new gadgets - in particular a Eee PC 701SD which only cost $327 AU from JB Hifi so I have a decent computer for the conference. I'll be posting a lot more about it in the coming weeks, but am just mentioning it now as it is linked to today's post. Allow me to explain - while I've kept the default Xandros install on the internal 8 gig solid state drive I've installed Debian on a 2 gig SD card. 2 gig. yep. small, isn't it? and encrypted, but that's for another post. The point is that I've been looking for lightweight alternatives to all the software that I traditionally use in my day to day tasks, so while I'll happily leave Amarok alone on Xandros, I didn't really want to pull in all the KDE dependencies to have it on Debian, and I've come across a nice little ncurses music player called cmus to use instead.

Now, on my desktop and main laptop I use Amarok pretty much exclusively and have tried to keep all the tags in my music collection accurate - I try to check the track listing, the genre, the year and that the capitalisation complies with English capitalisation rules (except when it is apparent that the odd capitalisation is a concious decision on part of the artist and forms part of the art). I'm well aware that I've missed some - some of the artists that have been in my collection for longer still have bad capitalisation and I've only started to check the accuracy of the album years recently.

But there is a larger problem - Amarok doesn't reveal every tag to me. While that doesn't matter in the least as long as I'm only using Amarok, it can matter when I use other media players. I'm not worried about any of those albums I own a physical copy of - they're all in ogg (but if you do need a powerful ogg tag editor, tagtool's advanced mode _looks_ promising), but rather the music I've downloaded and have left in mp3. I've been aware of the issue for a while because I occasionally observe some of the symptoms on the various media players available on the Internet Tablet. I have looked at dedicated tag editors, but until now I haven't been able to find one that would show me *every* tag - not just the one's it's programmed to recognise, not just the id3v2.3 tags, but all of them. And not just the first 30 characters of them either.

Why is this *so* important, they're just extra tags, right? Well, my biggest annoyance is that cmus uses the contents of the TPE2 tag if it is present for the Artist in it's library view rather than the TPE1 tag which Amarok uses. TPE1 is defined as "Lead performer(s)/Soloist(s)", while TPE2 is defined as "Band/orchestra/accompaniment". Now, the TPE2 tag may well be perfectly valid and correct, but it is not a tag that I have been organising or validating so far with Amarok, so I'd like to get everything consistent and delete the TPE2 tags. While I'm at it, why not remove all the cover art from the mp3s - I've always felt it wasteful to keep 12 copies of the same image when I could and do just put a single image in the same folder. In fact, why not go and remove all the tags that aren't recognised by Amarok - do I really care that it was encoded with lame? I might be happy to leave the 'free download from http://www.last.fm' comment tags alone and I certainly don't want to destroy any comments that I've added, but do I really want any of the other comment tags in there?

So I finally found a id3 tag editing tool that can show me most of the tags - eyeD3. It's still not perfect - there isn't any support for id3v2.2, it doesn't show me the tags that replaygain uses and it did crash while parsing some of the mp3s - I dare say I'll have to come back to those later with another tool, even if it is hexedit. Edit: As the author pointed out, eyeD3 is in fact able to read id3v2.2 tags, just not write them and those crashes will doubtless be solved in no time.

The first step was to find out what tags are actually present in my collection:

find music -iname "*.mp3" -exec eyeD3 -v {} \; | tee index
sort -u index | awk -F\): '/^<.*$/ {print $1}' | uniq | awk -F\)\> '{print $1}' | awk -F\( '{print $(NF)}' > tags

So, that gives me a list of all the different types of tags in my collection - 44 unique tags in my case. Next step is to work out which ones are used by Amarok and if I want to keep any of the others. While I could go through and speculate on which of the three tags I can immediately see that might be a year, it's probably a better idea to look at the source code.

apt-get source amarok libtag1c2a
view amarok-1.4.9.1/amarok/src/metabundle.cpp
view taglib-1.4/taglib/mpeg/id3v2/id3v2tag.cpp

Some immediately obvious tags because it names their identifier directly are TPOS (Disc number), TBPM (beats per minute), TCOM (Composer - admittedly this is one tag that I have not been validating), TPE2 (which is marked as a non-standard MS/Apple extension - so it is aware of it but since it's messing up my collection and Amarok doesn't seem to display it anywhere I'm getting rid of it anyway) and TCMP (Compilation album, ie, show under various artists. Unfortunately cmus doesn't appear to use this tag, though does seem to have some logic for compilation albums - this is a matter I will need to investigate further later on).
Digging deeper to look past the nice friendly names that the programmers can recognise to the harsh id3 reality I also identify that I'll need to keep title (TIT2), artist (TPE1), album (TALB), comment (COMM), genre (TCON), year (TDRC) and track (TRCK) - as well as anything that is used when playing the file that isn't identified here.

Though Amarok can use images embedded in the mp3s, I don't want any - I much prefer to use Amarok's cover manager combined with copycover-offline.py to copy them into the appropriate directory (look through the comments for useful patches - hmmm, should probably submit my fix for albums with Various Artists come to think of it).

So, I made a list of these tags, one per line in a file called amaroktags. Then found all the tags in my collection that aren't supported by Amarok:

cat amaroktags tags | sort | uniq -u
view taglib-1.4/taglib/mpeg/id3v2/id3v2.4.0-frames.txt


Which left me with a list of tags that I wanted to keep:
COMM, TALB, TBPM, TCMP, TCOM, TCON, TDRC, TIT2, TPE1, TPOS, TRCK, MCDI (Music CD Identifier), TFLT (File type), TLEN (length, used for seeking), TSRC (International Standard Recording Code - the only album using it in my collection is Nine Inch Nail's Ghosts I-IV)

And an even larger list of tags to zap:
TPE2, APIC (Attached picture), TDTG (Tagging time), GEOB (arbitrary file), PCNT (Play count), POPM (Popularimeter), PRIV (private textual & binary data), TCOP (copyright), TDEN (encoding timestamp), TENC (Encoded by), TIT1 (content group description), TIT3 (Description refinement), TLAN (language), TMED (Media type), TOAL (Original title), TOFN (original filename),
TPUB (publisher), TSSE (encoding settings), TXXX (User defined text), UFID (unique file identifier), USLT (lyrics), WCOM (commercial info), WOAR (artist web page), WXXX (other URL)

As well as these ones that I couldn't identify, so I'll zap em and hope nothing breaks:
NCON, TAGC (appears to be a timestamp)

And a couple to manually check later:
TOPE (Original artist - I notice that Kong in Concert uses these for the original track names, though not accurately - they should probably be in TOAL), TYER and TDRL (years with subtly different meanings - taglib does seem to fallback and use these, but I will need to check for conflicts)

So, now I have a pretty definitive list of tags it's time to zap em' (after backing up in case something blows up in my face of course). Although not immediately obvious it appears that using the --set-text-frame specifying the 4 letter name of the frame and no contents will remove it, even if it isn't a text frame. Now, this doesn't appear to actually conserve any space in the file - it shuffles the rest of the tags upwards and zeroes out the gap (presumably conserving the space would be possible, but I don't know an easy way off the top of my head - suggestions welcome). There may be some tags that you want to have more intelligent processing on - maybe only remove some of the images or maybe only remove some of the GEOBs and if that is the case read the eyeD3 documentation, but for me I'm sick of them all and want them gone:


find music -iname "*.mp3" -exec eyeD3 --set-text-frame=TAGC: --set-text-frame=TPE2: --set-text-frame=TDTG: --set-text-frame=TCOP: --set-text-frame=TDEN: --set-text-frame=TENC: --set-text-frame=TIT1: --set-text-frame=TIT3: --set-text-frame=TLAN: --set-text-frame=TMED: --set-text-frame=TOAL: --set-text-frame=TOFN: --set-text-frame=TPUB: --set-text-frame=TSSE: --set-text-frame=TXXX: --set-text-frame=UFID: --set-text-frame=USLT: --set-text-frame=WCOM: --set-text-frame=WOAR: --set-text-frame=WXXX: --set-text-frame=NCON: --set-text-frame=APIC: --set-text-frame=GEOB: --set-text-frame=PCNT: --set-text-frame=POPM: --set-text-frame=PRIV: --set-text-frame=TCMP: {} \; | tee log


Depending on how large your collection is, at this stage you may choose to blink, stretch your arms, get some coffee, go to bed or take a vacation. Personally, I wrote a blog post.

I still have some things I know I'll have to fix up - the Deus Ex Soundtracks all seem to have multiple redundant comments, and there are some non English comment fields, but you should by this stage have a decent understanding on how to do this - that is of course, if this whole article didn't just go over your head (congrats if it did and you still read this far though :)

update: It turns out that the TCMP frame is not actually set by Amarok, so my solution is to remove all the TCMP flags from the library (I've added it to the above list, though where they are 1 in my collection is correct, but very few of the other tracks in the same album are tagged in the same way and would explain some odd behaviour when importing the albums), then to manually add them for all relevant tracks, which hopefully will ease future migration. Unfortunately as best I can tell, cmus doesn't appear to have any concept of compilation albums in it's id3.c. OGG files will supposedly get them since their tags don't require almost one thousand lines of C code to process (by contrast, cmus' vorbis.c file has a mere 285 lines including 33 lines of tag parsing), which begs the question as to why only 1 of my OGG compilation albums are marked as such in cmus.

find music/V/Various\ Artists/ -iname "*.mp3" -exec eyeD3 --set-text-frame=TCMP:1 {} \;


update: I've written a simple shell script to do this automatically, just save this as striptags.sh and execute it from your music directory:

#!/bin/sh

oktags="COMM TALB TBPM TCMP TCOM TCON TDRC TIT2 TPE1 TPOS TRCK MCDI TFLT TLEN TDTG"

indexfile=`mktemp`

#Determine tags present:
find . -iname "*.mp3" -exec eyeD3 -v {} \; > $indexfile
tagspresent=`sort -u $indexfile | awk -F\): '/^<.*$/ {print $1}' | uniq | awk -F\)\> '{print $1}' | awk -F\( '{print $(NF)}' | awk 'BEGIN {ORS=" "} {print $0}'`

rm $indexfile

#Determine tags to strip:
tostrip=`echo -n $tagspresent $oktags $oktags | awk 'BEGIN {RS=" "; ORS="\n"} {print $0}' | sort | uniq -u | awk 'BEGIN {ORS=" "} {print $0}'`

#Confirm action:
echo
echo The following tags have been found in the mp3s:
echo $tagspresent
echo These tags are to be stripped:
echo $tostrip
echo The tags will also be converted to ID3 v2.4 where appropriate
echo
echo -n Press enter to confirm, or Ctrl+C to cancel...
read dummy

#Strip 'em
stripstring=`echo $tostrip | awk 'BEGIN {FS="\n"; RS=" "} {print "--set-text-frame=" $1 ": "}'`
find . -iname "*.mp3" -exec eyeD3 --to-v2.4 $stripstring {} \; | tee -a striptags.log