Mr. Norda, in the project "Voice Controlled Production" at Fraunhofer IDMT in Oldenburg, you are developing intuitive voice control for machines that are intended to simplify work in the production process. Can you explain the process?
Marvin Norda: Our speech recognition with all its features can recognize speech in different environments and derive information from it. That's why we don't focus on a specific area of industrial production, because where speech recognition is later used is secondary.
The system creates a text file based on phonemes, which can then be read by any control system, regardless of whether it is a robot, a lathe, an automatic production machine or an automated guided vehicle.
What size is the speech recognition hardware?
Norda: In principle, we focus on the needs of our customers. On the one hand, there is the option of secure online access to our speech recognition technology. In other words, the customer can process the technology on his server or access our Fraunhofer server. On the other hand, the voice control or voice documentation can be processed offline on a minicomputer, for example directly at the machine. So we are very flexible. We have deliberately designed our speech recognition to run on small hardware. It is not necessary to connect a large component. However, it is often the case, especially in the production sector, that very few people want to control their machines via a small computer. Even if it works well and reliably. For example, a Raspberry PI is not a certified control unit for the production area.
That's why many companies use a programmable logic controller (PLC), which is established in the production environment and well-known to testing organizations. However, a rethinking is taking place. Small micro PCs are gradually finding their way into the industrial sector and will therefore sooner or later also be certified for such applications.
Many users of voice controls, for example Google or Alexa, repeatedly experience situations in which what was spoken was not recognized. Sometimes this is due to unclear pronunciation, sometimes because a term is not known. How clearly does the user need to speak with your voice control?
Norda: We have a standard vocabulary that we use and that we train with training material. That is sufficient. Unlike a few years ago, I no longer have to train a speaker either. It's enough that I have the training data to recognize different speakers.
Even if a dialect is spoken, for example?
Norda: It's a question of the desired accuracy. Of course, you can teach the software dialects. But that is usually not necessary. For example, someone who grew up in Bavaria (the far southern state in Germany) will understand a person from Bavaria better than he will understand an East Frisian (a northern state) and an East Frisian will understand an East Frisian better than a Bavarian. But that doesn't mean that a Bavarian and an East Frisian can't understand each other. This is similar with speech recognition. If I provide more training material in this direction, it works better. Just as misunderstandings can happen to an East Frisian in Bavaria, you understand something incorrectly because you have not been trained for it.