Suggestion (not just for CJ): Automatic lipsync

Started by GarageGothic, Mon 12/05/2008 00:49:15

Previous topic - Next topic

GarageGothic

While surfing around, I came across this open-source lipsync software. It analyzes a voice clip (.wav, .ogg, .mp3*) and automatically assigns phonemes to it. It then writes the data in a fully documented file format. But I assume that the source code can be changed to output data to Pamela and other formats. (To see it in action, check out the demo for Lipsync Tool a commercial implementation of the technology - I tried it on a few old voice samples, and was quite impressed).

With the growing number of voice acted AGS games, lipsync support is becoming a standard feature of AGS. Unfortunately the only supported lipsync program, Pamela, is crash-prone and slow to work with. Everything has do be done manually, and the text of each voice line must be pasted in. The Al Emmo team spent a month or more just on the lipsyncing. It's probably wise that Dave Gilbert hasn't tried to use it on his games, or we would still be waiting for Blackwell Legacy.

However, with this source code lipsyncing would become a batch process with little-to-no work from the developer.

I don't really think this needs to be integrated in the AGS engine, but could work fine as a stand-alone tool (which would also prevent any license issues for the finished game). Unless CJ sees any need to add further lipsync format support, it would have to be customized to output Pamela files though. Unfortunately I don't have the programming skills to do any of this, but I thought it might be a good idea to make you aware of this technology, and maybe hear what people think about its possibly use with AGS.

Edit: I see now that it requires Microsoft's SAPI SDK to be installed, but I don't think that would be a problem for developers dedicated enough to recruit voice actors.

*Edit 2: It seems that, unlike the Lipsync Tool, the source code only supports .wav files. I don't know how much work it would be to integrate .ogg and .mp3 decoding before running the sync.

Edit 3: Made some further tests. I ran the compiled binary from the source code distribution file on a couple of wave files (purely automated, textless sync) and then played the output back in Lipsync Tool.  They were Blackwell Convergence samples of Joey and Rosangela that Dave posted on his forums, and both voices synced up great. I'm also quite impressed how many of the words that the speech recognition identified correctly in the text output (not that the exact words as written are all that important for the sync).

Rui 'Trovatore' Pires

All of this interests me hugely and immensely. I have no programming skills outside AGS (which is "scripting", not "programming", anyway), so I can't help... but this sounds extraordinarily useful, if possible.
Reach for the moon. Even if you miss, you'll land among the stars.

Kneel. Now.

Never throw chicken at a Leprechaun.

GarageGothic

#2
I uploaded a short test clip. Click here to view directly in your browser. Please don't sue me for copyright infringement, Dave :). I should also add that I recorded some lines in my own language, Danish, and the lipsync works just as well with non-English voice samples.

Edit: Fixed the video format and uploaded to streaming server so you don't have to download anything. Note that the lipsyncing was done entirely with the open-source tools. The commercial Lipsync Tool software is only used for playback in the example.

Pumaman

Impressive. Yeah, I think this is something better suited to a standalone utility, or editor plugin, if anyone feels so inclined.

GarageGothic

#4
CJ, I know that voice lipsyncing is an unsupported feature of AGS. But would it be possible to get some documentation on how AGS interprets the Pamela format? I haven't been able to track down any official documentation of the format, and looking through the source code didn't help me. I'm messing about a bit with visual c++ to see how hard it would be to output .pam files from the automatic lipsync source.

I assume that since you can customize your own phonemes, that part of the conversion isn't a problem. As long as the phonemes put out by the automatic lipsync match those in AGS' lipsync table everything should be fine, right?

The timings puzzle me however, I thought they were measured in frames (Pamela's fps, not AGS loops), but the numbers seem far too high. The timings are not in sequence either, so they can't be times measured from the beginning of the wave file. Are they additive (i.e. first phoneme "1215:S" runs from 0 to 1215, second phoneme "765:IH1" runs from 1215 to 1215+765=1980)? If so, how is silence handled? I see no spaces in my .pam files?

Edit: Playing around a bit, I discovered that the timings are indeed measured from the beginning of the wave file, they're just listed totally out of order in the .pam file. Does this (lack of) order have any special significance? I still can't work out what the timings are measured in though. Changing the framespersecond doesn't seem to alter the timings. And framesperphoneme seems only to be used internally in Pamela when breaking up the phonemes from a text?
While playing around with Pamela, I also found out something about the lack of pauses. It seems that each phoneme frame is held until the next one starts, and that the default phoneme set of Pamela phonemes doesn't include silence (closed mouth frame). But if we were to associate a frame in AGS with a symbol to indicate pause (the lipsync software uses "), and used the same symbol in the .pam file, it should work, I believe. Can you please confirm this?

Also, just to make sure, is it possible for AGS to interpret timing values that aren't mutliples of 15 (as seems to be the default in Pamela and not very precise)? And do AGS at all parse the framespersecond and framesperphoneme values?

Edit 2: Sorry about the constant updates, it's just that I keep playing around with the programs. After adding a phoneme to a distinctive section of a voice clip in Pamela and locating the same part in Audacity, I worked out that Pamela values are the timing in seconds multiplied by 360, for some unknown reason.

So the basic conversion would now be:

*Retrieve phoneme start value from automatic lipsync in milliseconds
*Multiply this value with 0.360
*Retrieve the phoneme as-is. If we set the lipsync in AGS up properly we don't need any conversion

The above is a very simple process which could even be accomplished in AGS script. However, for the lipsync process to be manageble for the user, we'd still need:

*Batch processing of a whole audio folder (*.wav doesn't currently work as a parameter)
*Decoding of .mp3 and .ogg files to a temporary .wav before lipsync (could be done with external program before syncing, but would be nice to have integrated)
*Output of (modified as specified above) lipsync data to files named after the source audio files with the extension .pam (currently the program outputs data to a console)

As mentioned above, I downloaded Visual C++ Express to try to change the source code. But so far I haven't even figured out how to import it!  :P I convert it without any errors reported, and am then told it can't be opened in this version of Visual Studio. Perhaps I should try another compiler.

AGD2

I can certainly vouch for the fact that it takes a very, very long time to lipsynch lines for an AGS game. A faster and easier way to do this would be most welcome!

I had previously looked into these Annosoft programs, but they seemed very pricey. I never knew that they also offered a free version of their source code. Nice find!

With automated lipsycnch, you're probably not going to get as accurate results as you would from manually syncing the lines yourself. Background noise and bad quality recordings can result in phantom phonemes being added during an automated process.  But, of course, you could always go back and tweak them in Pamela later on (if they're indeed going to get output to .pam format.)

That Annosoft lipsync program also has tons of phonemes (a lot of the transitional ones) that are entirely unnecessary for a 2D AGS game.  I've done  fair bit of experimenting in regards to getting decent looking results in AGS with a minimum amount of portrait frames, and you really only need 8 phoneme frames in AGS to have convincing lipsync animations. I use the following Pamela phonemes to represent the visual mouth frames, and would suggest that all of the redundant phonemes in Annosoft's program revert to the most relevent of the 8 frames below (or if you want to use more or less phonemes, this aspect could even be entirely tweakable):

ZH = (Mouth closed frame)
AY0 = Mouth in A position. Used for the letters:  A, I, U
W = Mouth in W position. Used for the letters: Q, W
EH0 = Mouth in E position. Used for the letters: C, E, G, H, K, R, Y
S = Mouth in S position. Used for the letters: CH, D, J, N, S, SH, T, X, Z
F = Mouth in F position. Used for the letters F, V
L =Mouth in L position. Used for the letter: L
AO0 = Mouth in O position. Used for the letter: O
B = Mouth in M position. Used for the letters: B, M, P

This source code looks promising, though. Hopefully somebody will have luck turning it into something that can be used to simplify this time-consuming process!

GarageGothic

Thanks for your input, AGD2. Great to hear from someone experienced with lipsyncing!

Of course an automated process will never be as accurate as manual lipsyncing. But considering the time saved, I'm quite happy with my tests so far. Even low quality samples (a low-bitrate clip sampled from an old VHS tape and voice recorded with a cheap headset) seemed to process quite well. As you say, it's always possible to tweak the file in Pamela, and it probably will be necessary for non-vocal sounds such as a character coughing.

Your phoneme list will also be very useful. I just have to figure out whether it makes most sense to force simplified phonemes during file format conversion or just let the developer set it up himself within AGS. I guess the latter solution would be more flexible, though you're right that the total amount of phonemes output by Annosoft's code is overkill for non-3D games.

Regarding my attempts with the source code, it turns out that the source depends on Microsoft's ATL classes which are not included with the free Visual C++ Express. Is there anyone out there with the full Visual C++/Visual Studio version who would like to give it a try?

Pumaman

Quote from: GarageGothic on Tue 13/05/2008 03:29:28
CJ, I know that voice lipsyncing is an unsupported feature of AGS. But would it be possible to get some documentation on how AGS interprets the Pamela format?

Ok, I've uploaded an extract from the AGS editor source code. This is the function that compiles the PAM files, so that you can see what AGS does with them:
http://www.adventuregamestudio.co.uk/temp/pamreader.txt

Basically, it just reads each phenome and converts the timing, and that's basically it.
Currently it uses a hardcoded assumption of how to translate the pamela timings into milliseconds.

QuoteRegarding my attempts with the source code, it turns out that the source depends on Microsoft's ATL classes which are not included with the free Visual C++ Express. Is there anyone out there with the full Visual C++/Visual Studio version who would like to give it a try?

I don't think compiling the pamela source code would be particularly useful for this -- all you'd need to do would be to write a separate application that could convert output files from Annosoft into .PAM files that AGS can read.

GarageGothic

#8
Thanks for the code, CJ! It's nice to know exactly which parts of the file AGS interprets and which it ignores. It's a bit ironic that AGS internally interprets the data to a format much closer to Annosoft's (in milliseconds and with end timings). For a moment the timing calculation confused me, but it all adds up to 0.360 as I had discovered (dividing a number by (1000/15)/24=2.777777 is the same as multiplying it by 0.36). That is only true if the fps setting hasn't been changed in Pamela, but with AGS using hardcoded values it's not a problem.

Quote from: Pumaman on Tue 13/05/2008 19:41:51
QuoteRegarding my attempts with the source code, it turns out that the source depends on Microsoft's ATL classes which are not included with the free Visual C++ Express. Is there anyone out there with the full Visual C++/Visual Studio version who would like to give it a try?

I don't think compiling the pamela source code would be particularly useful for this -- all you'd need to do would be to write a separate application that could convert output files from Annosoft into .PAM files that AGS can read.

Ah, no I meant for compiling the modified Annosoft code. I'm not touching the Pamela source at all, only using it for reference.


smiley

#9
I've made a test editor plugin that converts the output of the Annosoft program to Pamela's format:
http://ueberlicht.googlepages.com/AGS.Plugin.generatepam.dll
(only wav files atm; sapi_lipsync.exe has to be in the editor folder)

I think I'll add that to the audio manager plugin...

SSH

12

GarageGothic

#11
Excellent! Do you plan to add .ogg and .mp3 support? It would definitely be great to be able to batch process speech files from the Audio Manager menu.

I guess this means that I can cancel my, ahem, aquiring of the 3GB+ Visual Studio :)

AGD2

Wow, very nice work, Smiley!  I've tested it out briefly and it's pretty impressive. At the moment it doesn't work when loading the generated .pam files into Pamela, on account of some of the letters being lowercase and others not having a number after them. I'll post some more about this tomorrow.  (Pamela only recognizes upper case letters, although AGS probably isn't as picky.)

Oh, one thing that would be handy is to have an option to offset all the phonemes to play a little earlier. The reason being that when speech is lip synced on-the-fly, the program has to process the sound first and then generate the letters. But in real life, people visibly move their lips into position before the vocalizations are produced.  So having the ability to offset the phonemes in that manner would really help it look more natural.

Great work though! I'll post more info on the complete phoneme list soon.

smiley

Quote from: GarageGothic
Do you plan to add .ogg and .mp3 support?
Yes. Probably by converting them back to wav...

Quote from: AGD2
At the moment it doesn't work when loading the generated .pam files into Pamela, on account of some of the letters being lowercase and others not having a number after them.
And I didn't include the "Preferences" section in the .pam file. ;)


I definitely want to add an editor for .pam files which also shows a preview of the speech animation.

GarageGothic

#14
QuoteYes. Probably by converting them back to wav...

Yeah, you could just decode them to a temporary file, use it as a source for Annosoft and then delete it. Since your AudioManager already supports .mp3 and .ogg playback, the lack of a wav wouldn't be an issue in your own phoneme editor.

QuoteI definitely want to add an editor for .pam files which also shows a preview of the speech animation.

That would be absolutely awesome! Especially with Pamela being so buggy, only working with .wav files and having a far from smooth preview playback. Being able to import, automatically lipsync and then tweak the lipsync animation, all within the editor would make the process so much smoother.

This whole thing has completely changed my view on adding speech to my game. Now it's beginning to seem manageable, despite the huge job of recording and processing the voice clips.


AGD2

#15
Quote from: Smiley
And I didn't include the "Preferences" section in the .pam file. ;)

I think AGS can get by without the "Preferences" section in the .pam file, but Pamela needs it to locate the directory of .wav file.  Maybe the directory could be written to the .pam file based on where you opened the .wav file from.

Quote from: Smiley
I definitely want to add an editor for .pam files which also shows a preview of the speech animation.

That would be an excellent addition. Would this editor be only for preview purposes? Or would it also allow you to move phonemes around at will and change them to other values like Pamela's editor?



Anyhow, here's the full phoneme list from Pamela. It shows which of the 8 selected "AGS" phonemes I have each Pamela phoneme revert to. Perhaps this could be considered for usage as the 'default' setting, but you could also allow users to tweak, change, add and delete phonemes as they see fit.

Note that the table below encompasses ALL existing Pamela phonemes. This is exactly how I set them up in AGS's "Lip sync" section:

0:  ZH/None
1: AY0/AY1/AY2/AA0/AA1/AA2/AH0/AH1/AH2/AE0/AE1/AE2
2: W/OW0/OW1/OW2/OY0/OY1/OY2/UW0/UW1/UW2
3: EH0/EH1/EH2/CH/ER0/ER1/ER2/EY0/EY1/EY2/G/K/R/Y/HH
4: S/Z/IH0/IH1/IH2/IY0/IY1/IY2/SH/T/TH/D/DH/JH/N/NG
5: F/V
6: L
7: AO0/AO1/AO2/AW0/AW1/AW2/UH0/UH1/UH2
8: B/M/P

Those graphical frames in order are:

0: Mouth Closed
1: A frame
2: W frame
3: E frame
4: S frame
5: F frame
6: L frame
7: O frame
8: B frame

And here's an example of a dialogue portrait with those 8 frames in order:



Hope this helps.

GarageGothic

Quote from: AGD2 on Fri 16/05/2008 06:24:26I think AGS can get by without the "Preferences" section in the .pam file, but Pamela needs it to locate the directory of .wav file.  Maybe the directory could be written to the .pam file based on where you opened the .wav file from.

If you add phoneme editing to the plugin, it would make sense to read the .wav (or, hopefully, .mp3 or .ogg) directly from the speech folder. Especially as people tend to move their files around when upgrading to a new AGS version. But for those wanting to use that god-awful Pamela tool, legacy support is a good idea.

How did you come up with that phoneme list? Trial and error? It seems to differ just slightly from the Preston Blair mouth shapes (see here for original, here for an alternate version, and here for 3D version). I think it's definitely important to set up a good default scheme, since that's what a lot of people will end up using. Perhaps even with a default animation to model your own artwork on and to test lipsyncing for characters who don't yet have speech animations. Would it make sense to have two default phoneme setups, simple phonemes (for pixel art) and extended phonemes (for hi-res or pre-rendered art)?

AGD2

#17
Quote from: GarageGothic
How did you come up with that phoneme list? Trial and error? It seems to differ just slightly from the Preston Blair mouth shapes (see here for original, here for an alternate version, and here for 3D version).

Yes, this was completely trial and error. I didn't base the visual mouth shapes strictly off the Pamela sample ones, nor off any other set. I just figured out the absolute minimum amount of frames that could be used to cover all phonemes and still look convincing. I merged some of the vowel phonemes. Rather than having individual phonemes for each of the 5 vowels, now 3 phonemes cover them all (A and U were merged. E and I were merged.) The above list was the final result that I came up with. For Al Emmo, there were actually 9 phonemes in use (an additional T phoneme), but I decided that it could be dropped to bring it down to only 8. The resulting mouth animation is virtually undetectable, as S covers T very well.

Keep in mind that this is strictly from a Pamela/AGS perspective and doesn't take into account which Annosoft phonemes currently get assigned to each Pamela Phoneme in the conversion process. You would probably need to first compile a full list of all the Annosoft phonemes, compare them visually, and figure out which mouth frames look similar. Then you'd need to group all of those Annosoft phonemes into the  A, W, E, S, F, L, O, B, and 'Mouth Closed' categories, so that Annosoft and Pamela are both working with the same "AGS" phoneme set (if that makes sense.)

Quote from: GarageGothic
Would it make sense to have two default phoneme setups, simple phonemes (for pixel art) and extended phonemes (for hi-res or pre-rendered art)?

To be honest, I don't think it'd be necessary to have a seperate set of phonemes for high-res and pixel-art portraits. The mouth frames tend to move so quickly that you don't really notice how many phonemes there are. Looking at the Graham one I posted above, you'd probably be hard pressed to tell that there are 8 phonemes involved by just casually looking at it. 

I guess this is a good argument as to why it'd be ideal to allow people to add more phonemes if they think they'll need them for higher res pre-rendered portraits. After all, no sense in limiting people to one standard.  But personally speaking, I don't think having two defaults would make a great deal of visual difference. People tend to not look at the lip movements closely after a while either.

GarageGothic

#18
From a quick comparison it seems that Annosoft use pretty much the same phonemes as Pamela. Though whithout the numbers (not sure what they signify, stresses?). Here's their phoneme list from Microsoft SAPI 5.1, which Annosoft uses:

Quote.SYM Example PhoneID
- syllable boundary (hyphen) 1
! Sentence terminator (exclamation mark) 2
& word boundary 3
, Sentence terminator (comma) 4
. Sentence terminator (period) 5
? Sentence terminator (question mark) 6
_ Silence (underscore) 7
1 Primary stress 8
2 Secondary stress 9
aa father 10
ae cat 11
ah cut 12
ao dog 13
aw foul 14
ax ago 15
ay bite 16
b big 17
ch chin 18
d dig 19
dh then 20
eh pet 21
er fur 22
ey ate 23
f fork 24
g gut 25
h help 26
ih fill 27
iy feel 28
jh joy 29
k cut 30
l lid 31
m mat 32
n no 33
ng sing 34
ow go 35
oy toy 36
p put 37
are red 38
s sit 39
sh she 40
t talk 41
th thin 42
uh book 43
uw too 44
v vat 45
w with 46
y yard 47
z zap 48
zh pleasure 49

So conversion shouldn't be necessary, except the few compatibility issues already mentioned - capital letters and the number after the phoneme. If your phoneme list doesn't distinguish between AY0, AY1, AY2 and so on, it should be safe to just add 0 to the phonemes needing a number. I haven't seen the stress values used in any of my Annosoft scripts so far, and I assume they're just not part of the automatic lipsync output - perhaps they're used if you also input a source text?

AGD2

#19
Looks pretty similar, yeah.  And yes, the numbers after Pamela phonemes represent stresses. Pamela was originally designed to assist with lip syncing for another program called "Magpie", which makes use of stresses. However, since AGS simply holds the phoneme in place until the next one supercedes it, stresses aren't needed in AGS .pam files and all of them can simply have 0 at the end.


Quote from: GarageGothic- perhaps they're used if you also input a source text?

Speaking of which, I don't suppose that feature is available in the free source code that Annosoft offer? I know that one of their commercial SDK's allows you to open a wav file AND also type the line's text in order to have both methods work together to calculate the best phoneme placement along the waveform.  This method seemed more accurate than plain on-the-fly lip syncing in Annosoft's program.


--EDIT--

Forgot to mention that the ZH phoneme is one that I hi-jacked for the "mouth closed" frame. In both Pamela and Annosoft it's used for a 'Z' sound, but since 'S' covers that (and Pamela doesn't have a default "mouth closed" frame), I reserved ZH for the "mouth closed" frame instead.

Note that Pamela also has a 'None' phoneme that displays if you forget to assign a phoneme letter to the bar. In this case, I also made unassigned 'None' phonemes revert to the "mouth closed" frame.  Just some things to keep in mind when doing the Annosoft>Pamela phoneme conversion.

SMF spam blocked by CleanTalk