CJ, I know that voice lipsyncing is an unsupported feature of AGS. But would it be possible to get some documentation on how AGS interprets the Pamela format? I haven't been able to track down any official documentation of the format, and looking through the source code didn't help me. I'm messing about a bit with visual c++ to see how hard it would be to output .pam files from the automatic lipsync source.
I assume that since you can customize your own phonemes, that part of the conversion isn't a problem. As long as the phonemes put out by the automatic lipsync match those in AGS' lipsync table everything should be fine, right?
The timings puzzle me however, I thought they were measured in frames (Pamela's fps, not AGS loops), but the numbers seem far too high. The timings are not in sequence either, so they can't be times measured from the beginning of the wave file. Are they additive (i.e. first phoneme "1215:S" runs from 0 to 1215, second phoneme "765:IH1" runs from 1215 to 1215+765=1980)? If so, how is silence handled? I see no spaces in my .pam files?
Edit: Playing around a bit, I discovered that the timings are indeed measured from the beginning of the wave file, they're just listed totally out of order in the .pam file. Does this (lack of) order have any special significance?I still can't work out what the timings are measured in though. Changing the framespersecond doesn't seem to alter the timings. And framesperphoneme seems only to be used internally in Pamela when breaking up the phonemes from a text?
While playing around with Pamela, I also found out something about the lack of pauses. It seems that each phoneme frame is held until the next one starts, and that the default phoneme set of Pamela phonemes doesn't include silence (closed mouth frame). But if we were to associate a frame in AGS with a symbol to indicate pause (the lipsync software uses "), and used the same symbol in the .pam file, it should work, I believe. Can you please confirm this?
Also, just to make sure, is it possible for AGS to interpret timing values that aren't mutliples of 15 (as seems to be the default in Pamela and not very precise)? And do AGS at all parse the framespersecond and framesperphoneme values?
Edit 2: Sorry about the constant updates, it's just that I keep playing around with the programs. After adding a phoneme to a distinctive section of a voice clip in Pamela and locating the same part in Audacity, I worked out that Pamela values are the timing in seconds multiplied by 360, for some unknown reason.
So the basic conversion would now be:
*Retrieve phoneme start value from automatic lipsync in milliseconds
*Multiply this value with 0.360
*Retrieve the phoneme as-is. If we set the lipsync in AGS up properly we don't need any conversion
The above is a very simple process which could even be accomplished in AGS script. However, for the lipsync process to be manageble for the user, we'd still need:
*Batch processing of a whole audio folder (*.wav doesn't currently work as a parameter)
*Decoding of .mp3 and .ogg files to a temporary .wav before lipsync (could be done with external program before syncing, but would be nice to have integrated)
*Output of (modified as specified above) lipsync data to files named after the source audio files with the extension .pam (currently the program outputs data to a console)
As mentioned above, I downloaded Visual C++ Express to try to change the source code. But so far I haven't even figured out how to import it!
I convert it without any errors reported, and am then told it can't be opened in this version of Visual Studio. Perhaps I should try another compiler.
I assume that since you can customize your own phonemes, that part of the conversion isn't a problem. As long as the phonemes put out by the automatic lipsync match those in AGS' lipsync table everything should be fine, right?
Edit: Playing around a bit, I discovered that the timings are indeed measured from the beginning of the wave file, they're just listed totally out of order in the .pam file. Does this (lack of) order have any special significance?
While playing around with Pamela, I also found out something about the lack of pauses. It seems that each phoneme frame is held until the next one starts, and that the default phoneme set of Pamela phonemes doesn't include silence (closed mouth frame). But if we were to associate a frame in AGS with a symbol to indicate pause (the lipsync software uses "), and used the same symbol in the .pam file, it should work, I believe. Can you please confirm this?
Also, just to make sure, is it possible for AGS to interpret timing values that aren't mutliples of 15 (as seems to be the default in Pamela and not very precise)? And do AGS at all parse the framespersecond and framesperphoneme values?
Edit 2: Sorry about the constant updates, it's just that I keep playing around with the programs. After adding a phoneme to a distinctive section of a voice clip in Pamela and locating the same part in Audacity, I worked out that Pamela values are the timing in seconds multiplied by 360, for some unknown reason.
So the basic conversion would now be:
*Retrieve phoneme start value from automatic lipsync in milliseconds
*Multiply this value with 0.360
*Retrieve the phoneme as-is. If we set the lipsync in AGS up properly we don't need any conversion
The above is a very simple process which could even be accomplished in AGS script. However, for the lipsync process to be manageble for the user, we'd still need:
*Batch processing of a whole audio folder (*.wav doesn't currently work as a parameter)
*Decoding of .mp3 and .ogg files to a temporary .wav before lipsync (could be done with external program before syncing, but would be nice to have integrated)
*Output of (modified as specified above) lipsync data to files named after the source audio files with the extension .pam (currently the program outputs data to a console)
As mentioned above, I downloaded Visual C++ Express to try to change the source code. But so far I haven't even figured out how to import it!
