PPT describing about automatic generation of subtitles having input as video.
AUTOMATIC SUBTITLE GENERATOR
By Lohith Kumar Menchu Manikanta Thumu Ravinder Putta
Video has become one of the most popular multimedia artefacts used on PCs and internet. In a majority of cases within a video, the sound holds an important place. For people with gaps in spoken language and auditory problems the most natural way lies in the use of subtitles.
Therefore, it is necessary to find solutions for the purpose of making these media artefacts accessible for most people. Here, we confine our research to the videos which has single speaker. If we try to employ SR technology in conversations or meetings where people frequently interrupt each other, were likely to get extremely poor results.
The current thesis work principally tends to answer out problematic by presenting a potential system.
Three distinct modules have been defined, namely audio extraction, speech recognition, and subtitle generation(with time synchronization).
The system should take a video file as input and generate a subtitle file as output. This extracted subtitles must also be synchronized with the video content. Speaker independent model presents an accuracy greater than 90% with peaks reaching 98% under optimal conditions (quiet room, high quality microphone).
In the existing system whether it is a single speaker or multi speaker media the subtitles are generated manually by some linguistic. However, manual subtitle creation is a long and boring activity and requires the presence of the user.
Moreover, the user need to know the language of video content in order to generate subtitles.
In present scenario we cannot generate subtitles for all languages. Software generating subtitles without intervention of individual using speech recognition has not been developed.
In the proposed system SR technology allows a computer to handle sound input through either a microphone or media file in order to be transcribed or used to interact with the machine. This analog form of a signal is converted into digital format and then divided into small segments which are then matched with known phonemes in appropriate language.
Speech recognition can be used to handle either a unique speaker or an infinite number of speakers. The first case which is our area of interest , presents an accuracy greater than 90% with peaks reaching 98% under optimal conditions. Various models are under construction but modern SR engines are based on the Hidden Markov Models.
HOW SPEECH RECOGNITION WORKS ??Rule Based
Early speech recognition systems tried to apply a set of grammatical and syntactical rules to speech.
If the words spoken fit into a certain set of rules, the program could determine what the words were.Accents, dialects and mannerisms can vastly change the way certain words or phrases are spoken, so this model has limited usage.
HOW SPEECH RECOGNITION WORKS ??Statistical-Modelling Approach
We basically have a model that has three fundamental components to it that model different aspects of the speech signal. Acoustic Model Lexicon Language Model
Acoustic models require engineers to collect all the sounds made by speakers of a particular language.
We differentiate two acoustic models: Speaker
Dependent. Speaker Independent.
The next part of the model is called the lexicon, the dictionary. And what that is, is a definition for all of the words in the language of how they get pronounced.
The third piece of the model is the model of how we put words together into phrases and sentences in the language.
So for example, that model might learn that if the recognizer thinks it just recognized "the dog" and now it's trying to figure out what the next word is, it may know that "ran" is more likely than "pan" or "can" as the next word just because of what we know about the usage of language in English.
HOW SPEECH RECOGNITION WORKS ??
Our scenario Automatic Subtitle Generator contains three important modules. Audio
Extraction Recognition Generation
The audio extraction routine is expected to return a suitable audio format that can be used by the speech recognition module as pertinent material. To facilitate the extraction of audio we use Java Media Framework API features. This API provides many interesting features for dealing with media objects.
The speech recognition routine is the key part of the system. Indeed, it affects directly performance and results evaluation.
First, it must get the type (film, music, information, home-made, etc...) of the input file as often as possible. Then, if the type is provided, an appropriate processing method is chosen.
The subtitle generation routine aims to create and write in a file in order to add multiple chunks of text corresponding to utterances limited by silences and their respective start and end times. The module is expected to get a list of words and their respective speech time from the speech recognition module and then to produce a SRT subtitle file.
In a cyber world where the accessibility remains insufficient, it is essential to give each individual the right to understand any media content. During the last years, the internet has known a multiplication of websites based on videos of which most are from amateurs and of which transcripts are rarely available.
This thesis work was mostly orientated on video media and suggested a way to produce transcript of audio from video for the ultimate purpose of making content comprehensible by deaf persons. Although the current system does not present enough stability to be widely used, it proposes one interesting way that can certainly be improved.
Tutorial : Getting started with the java media framework. URL http://www.ee.iitm.ac.in/~tgvenky/JMFBook/Tutorial.p df
How Stuff Works : http://electronics.howstuffworks.com/gadgets/hightech-gadgets/speech-recognition.htmEngineered Station. How speech recognition works. 2001. URL http://project.uet.itgo.com/speech.htm