Professional Documents
Culture Documents
AbstractSpeech recognition is common in electronic incisions using specialized long-range instruments and a camera
appliances and personal services, but its use for industrial view. For abdominal MIS, these incisions are arranged in a
and medical purposes is rare because of the presence of circle, and a laparoscope camera is inserted at the center [1]. At
motion ambiguity. For minimally invasive surgical robotic
assistants, this ambiguity arises because the robotic least one skilled assistant helps the surgeon during the operation
motion is not calibrated to the camera images. This paper by holding and moving the laparoscope, while the surgeon
presents a design for a speech recognition interface for manipulates the laparoscopic instruments using both hands [2].
an HIWIN robotic endoscope holder. A new intentional This task is as straightforward as it is tiring for the human assis-
speech control is proposed to control movement over long
tant: an endoscope is heavy and camera trembling and mishan-
distances. To decrease ambiguity, a method is proposed
for voice-to-motion calibration that compares the degree dling can occur after the first few hours of an operation. Routine
of change in the endoscope image for a voice command. tasks, such as camera handling, can be passed to a robotic holder.
A speech recognition algorithm is implemented on Ubuntu In the past, robots have been widely used for industrial pur-
OS, using CMU Sphinx. The control signal is sent to the poses in manufacturing. Modern production robots outperform
robot controller using serial-port communication through
humans, in terms of precision, speed, and throughput, in the
a RS232 cable. The experimental results show that the
proposed intentional speech control strategy has a naviga- performance of the specific tasks for which they are designed.
tion precision of up to 3.1 of angular displacement for the In MIS, no specific motion algorithm can be used for all oper-
endoscope. The overall system processing time, including ations. Surgeon-to-robot interaction is an inevitable step. Fully
robotic motion, is 3.22 s for 1.8-s speech duration. The robotized surgical systems, such as the state-of-the-art daVinci
reference image navigation range is from 2.5 mm for 0.5-s
system [3] from intuitive surgical, handle both the endoscope
speech duration up to 6 mm for 1.8-s speech duration,
using a setup with camera tip that is located at a distance and tool manipulators. More compact systems typically handle
of 5 cm from the remote center of motion point. only the endoscope. Robots use numerous control strategies that
Index TermsAutomated system, humanrobot inter- transform robotic holders from mere pieces of equipment into
face, motion control, robotic surgery, speech recognition intelligent assistants [4][6]. Li et al. demonstrated attention-
control. aware laparoscope navigation that uses eye tracking [7].
I. INTRODUCTION Nishikawa et al. proposed FAce MOUSe [8], whereby the
ODERN surgical practice has undergone significant facial expression of the surgeon is interpreted as a control signal
M changes since the introduction of minimally invasive
surgery (MIS). From the patients point of view, MIS typically
for the camera holder. While all of these methods are feasible,
the particular focus is on speech recognition, which is the
results in a faster recovery rate, smaller scars, and lesser damage most natural of the various surgeon-to-robot communication
to soft tissues, less pain, and a shorter hospital stay. However, methods [9].
from the doctors side, MIS requires hours of extra special The tools for natural speech recognition have been developed
training because of the specificity of the approach. Instead of quickly, with the emergence of powerful computing machines.
exposing patients organs to open air and direct manual inter- Many studies show examples of the successful use of speech
vention, surgeons perform operations through small (10 mm) control for mobile robots [10], humanoid robots [11], and aerial
robots [12]. Speech control has been well received by users
Manuscript received January 20, 2016; revised April 22, 2016, June (mobile and personal computer (PC) applications) and scientific
17, 2016, August 30, 2016, and October 7, 2016; accepted October 24, researchers, but its industrial applications are few. Recently, in-
2016. Date of publication November 7, 2016; date of current version April
18, 2017. This work was supported in part by the Ministry of Science dustrial robots have been equipped with a human-to-machine
and Technology under Grant MOST 103-2221-E-009-185. Paper no. interaction strategy that enables humanrobot collaboration and
TII-16-0058. (Corresponding author: K.-T. Song.) a more efficient use of manpower. Pires [13] showed the possible
K. Zinchenko is with the National Chiao Tung University, Hsinchu
30010, Taiwan (e-mail: zinchenko.kateryna@gmail.com). benefits of this type of humanrobot cooperation in the work-
C.-Y. Wu is with the Industry 4.0 Division, Fair Friend Group, Taipei, place, with improvements to robotic versatility and production
Taiwan 300 (e-mail: chien800614@hotmail.com). throughput. Even with these benefits, at its current stage of de-
K.-T. Song is with the Institute of Electrical Control Engineering,
National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: ktsong@ velopment, speech control is not entirely accepted by industry
mail.nctu.edu.tw). [14]. Rogowski [15] described problems that industry-oriented
Color versions of one or more of the figures in this paper are available voice control systems face and the requirements for their suc-
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TII.2016.2625818 cessful integration into an industry. The acceptance of robots in
1551-3203 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
608 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017
TABLE I
HIWIN MTG-H100 KEYBOARD CONTROLS
unit, which handles the writing and reading of data from the
HIWIN robots data transmission port. Data that are forwarded
to the port must be properly encoded and data received from the
port must be correctly decoded. When the data have been sent,
the unit awaits the robots response. The synchronization of the
system depends on the communication speed of the RS232; new
user commands are not processed until there is a response from
the robot.
point detection and is responsible for gathering, annotating, and An additional set of robot motion commands is required for
processing the input data. The front end also extracts features ISR to produce a longer movement. Matching the execution
from the input data that are read by a decoder. The annota- time with the period of the pronounced speech, which gives
tions that are provided by the front end include the beginning real voice control, is the optimal solution to navigation over
and the end of a data segment. At the input stage, the system longer distances using speech recognition. This function would
continuously receives and records the signal from the headset. further decrease the ambiguity that can arise when there is uncer-
This approach allows the system to estimate the average noise tainty about the duration of robotic movement for each control
level inside an operating room and to adjust the threshold value word, because the surgeon defines the period using pronunci-
at which signal processing begins. Determining the end of the ation. Different forms of this methodology are present in all
command speech is crucial in voice recognition applications. manual controllers; while there is an input signal the robot
A signal that is greater than the threshold value indicates the moves. The proposed system may be able to extract the nec-
start point of speech and a significant decrease in the signal am- essary information by interpreting the intention of the speech
plitude indicates the end point. If the threshold is not properly command.
calibrated, the system might misunderstand background noise Fig. 4 shows the plot for the HMM as it searches for the most
as a control command. suitable phoneme sequence that is related to the input word.
The knowledge base consists of three components: a dictio- The process of speech recognition determines the best possi-
nary, a language model, and an acoustic model. The knowledge ble sequence of words that fit the given input speech. This can
base provides information that the decoder needs to process be simplified to the matching of the sequence of phonemes to
signals. This information includes the acoustic model and the a specific word. Instead of sampling and decoding the entire
language model. The feedback from the decoder allows the voice signal, the input signal is sampled in terms of short ut-
knowledge base to dynamically improve itself, using success- terances and phonemes are detected in real time (see Fig. 5.)
ful search results. However, this feedback is not required for After sampling, the utterance is processed at the front end, the
restricted vocabulary systems. decoder and, finally, the phonemes are detected. A sequence of
The SRU uses the Hidden Markov Model (HMM) recognition phonemes is the output.
engine to train specific speech patterns and to extract words from At the next block, these phonemes are searched for the pres-
an acoustic signal and present them in the form of Mel Frequency ence of a phoneme that starts a command. For example, the
Cepstrum Coefficients. Filtering and sampling of the acoustic first received phoneme is l, which corresponds to the speech
signal precedes feature extraction. A sampling frequency of command, left. If the next phoneme corresponds to the second
16 kHz is used for this speech recognition system. phoneme of the word, leftnamely EHthe system sends
The decoder block, which comprises a hypothesis search and the corresponding movement command and continues to search
a state probability computation, performs the principal part of new phonemes for the word ending. Repeated vowels do not af-
the search. It selects the next set of likely states, scores incoming fect the search. In the opposite case, if phonemes do not match,
features against these states, dismisses the low-scoring states, a stop command is sent to the controller. Therefore, a robots
and finally generates results by selecting the most probable move signal is sent when the second phoneme matches with
hypothesis. The result is presented in the form of a text string the corresponding control words phoneme. The robots stop
that is chosen from the set of allowed robotic commands. The signal is transmitted when the last phoneme is decoded or if the
voice control system uses the CMU Sphinx open-source project phoneme does not match the control word, or if the recognition
that was developed by Carnegie Mellon University [15]. CMU system reaches the end of the speech (silence is detected). Four
Sphinx is provided in the form of software development tools control words are used for ISR: right, left, front, and back. Each
and libraries. word contains only one vowel that can be extended during pro-
nunciation: I, e, o, and a. During the experiments using
short and long pronunciation of these words, the vowel sounds
C. Recognition of Intentional Speech that are pronounced when the words are short are different from
One of the drawbacks of using voice control is that it is in- the vowel sounds that are decoded when the words are pro-
appropriate for long-distance motion. To move an endoscopic longed. Namely, the vowel e, in the first case is represented
camera a significant distance, the same command may have to only by the EH phoneme, but in second case, it is represented
be repeatedly given until the destination is reached. The natural by a set of EH and IY phonemes. Similarly, the phoneme
human reaction is to prolong words, so instead of left, the com- a, is represented by a set of AE and AH phonemes, o is
mand might be leeeeft, to indicate a longer motion. These two represented by a set of AH and OW phonemes, and i is
words sound the same to the human ear, but a computer registers represented by an AY phoneme. This change occurs because
a significant difference. The sounds of language are classified vowels are pronounced differently in the presence or absence
into what are termed phonemes. A phoneme is a minimal unit of of a consonant. This transformation of the phoneme means that
sound that has a semantic content., e.g., the phoneme AE ver- the addition of alternative representations for vowel phonemes
sus the phoneme EH defines the difference between the sounds is a crucial factor in the development of the system. In terms of
a and e in the words bat and bet. In terms of speech de- the design of the system, each of the vowels has two possible
coding, the main difference between the two terms, left and representations. During the experiments, the pronunciation of
leeeeft, is in the length of the input region that is sampled consonants remained constant, so they have only one form of
during voice recording and the repeated phoneme EH. representation.
ZINCHENKO et al.: A STUDY ON SPEECH RECOGNITION CONTROL FOR A SURGICAL ROBOT 611
Fig. 4. HMM for phonemes of word left and leeeft. If a dictionary contains words that can be distinguished by the very first phoneme, it is possible
to immediately identify the control word, generate appropriate robot command, and use consecutive phonemes for confirmation only. During the
same time, the duration of the rest of phonemes can indicate the duration of robots motion, allowing wider robots movement range.
Fig. 5. ISR block diagram. To be able to dispatch the robot command while person speaks, it is necessary to recognize the word from the first or
second letter. Thus, we use phoneme detection and map them against the phonemes that compose the words in dictionary.
and written to the port. The robots response is read from the
port, decoded, and checked for validity. The robots outputs dif-
ferent strings for MOVE and STOP commands. If a command
is not executed properly, another control message is written to
the port again until the robot produces the desired output. When
the command has been executed successfully, the serial port is
ready to receive a new command from the speech command
controller. The PC-to-robot communication uses the principle,
send-confirm receive-send-confirm. A write to the port follows
reading and analysis of the robots output message. In this study,
the PySerial open-source library is used for the serial control
block. PySerial is written in a Python programming language.
E. Development Environment
The voice control system is implemented on a PC, running
Ubuntu LTS 14.02 OS, Python programming language build 2.7
and an Intel i7 processor that uses PyCharm development envi-
ronment. The CMU Sphinx open-source library is used for the
speech recognition algorithm and open-source PySerial library
is used for RS232 communication programming. The entire
system is written in a Python language.
Fig. 6. Block diagram of RS232 communication unit, which is respon-
sible for command consistency check and command flow control.
III. SPEECH-TO-MOTION CALIBRATION
During MIS, one assistant usually holds the endoscope and
D. Serial-Port Communication
moves it in accordance with the surgeons commands. During
The PC is connected to the HIWIN MTG1000 robotic arm natural speech interaction, it is common to use ambiguous com-
using a serial-connection protocol. The serial-connection archi- mands, such as a little to the right or move more to the left.
tecture is shown in Fig. 6. First, the received decoded speech The assistant must understand the distance that the endoscope
command is checked for validity against the set of available must be moved to achieve the desired view. Since camera video
robot commands. The robot commands are divided into two is the only source of information about the situation inside a
groups: MOVE (right, left, front, back, in, and out) and STOP patient, a surgeons commands relate to the observed image,
(stop). After validation, the control string is converted to bytes rather than the change in the endoscopes position.
612 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017
Speech Pedal Control Voice Control Pedal Control Voice Control Speech Command Utterance Duration
Command
endoscope angular Corresponding calibration 0.5 s 0.8 s 1.8 s
displacement precision image displacement mm
Right 1.58 1.62 3.34
Right 12.3 7.7 4 2.5 Left 1.19 1.86 3.44
Left 12.3 7.7 4 2.5 Back 1.55 1.69 3.22
Back 3.9 3.1 2.3 1.8 Front 1.19 1.67 3.34
Forward 3.9 3.1 2.3 1.8
the apple moves only 2 mm to the right (this is the smallest dis-
typically greater than the range that is required for surgical
tance that the current system can achieve and different values
tasks so the calibration image must be adjusted accordingly.
can be used). Therefore, a surgeon can change the image view
without suddenly losing half of the apple. The change in the
IV. EXPERIMENTAL RESULTS
calibration image depends on the angular displacement of the
The SRU lies at the heart of the proposed system, so it is endoscope.
crucial that it is reliable. The first experiment determined the
success rate for the SRU that is shown in Fig. 3. A person B. Pedal Navigation Experiment Results
commanded the robot to move, using a headset, and the response
from the SRU was recorded. Command strings were classified To verify the voice-controlled system, the precision experi-
as a success or a fail. Every command was repeated ten ment was repeated using the foot pedal controller that was pro-
times by the same person. The average recognition rate was vided with the robotic system. To ensure an accurate comparison
90%, with a standard deviation of 5%. The processing time of the distances, the equipment and the calibration image that
for the SRU was 0.27 s, when the sampled utterance had been were used were the same as those used for the systems nav-
input to the system. A series of tests then determined the robots igation test. The joystick pedal was depressed in the desired
navigation precision, using the SRU, including a comparative direction until the robot performed the motion and then im-
test using the foot pedal. Two experiments were then performed mediately released. The images that were acquired are shown
using a separate ISR unit. The test dataset included 120 voice in Fig. 7(b). The relative distance that was traveled from the
commands: ten for each command (right, left, front, and back), starting point was then calculated. To determine the precision
and in each the word was uttered for a different length of time in degrees and the displacement of the calibration image for
(0.5, 0.8, 1.8 s). In the first experiment, the ISR latency was each single command, the methodology for the first experiment
calculated with respect to the different lengths of the commands. was used. The results of this test are summarized in Table II. It
In the second experiment, the change in the object on the image is seen that the overall system performance is consistent with
plane was measured for speech commands of different lengths. expectations for a voice-controlled robotic assistant. There is no
The system is designed to move the endoscope farther if the significant difference between the precision of navigation us-
command is longer. ing speech control and pedal control, which demonstrates that
voice-controlled systems perform sufficiently for practical use.
A. Results for System Navigation Precision
Table II shows the calibration measurement results for the C. ISR Latency and Precision
proposed robotic system. In the experiment, the best navigation A practical system must execute commands within a reason-
precision that the proposed system achieves 3.1 for the forward able time. This experiment calculated the latency of the ISR sys-
and backward tasks. It is seen that there is a difference in the tem. For latency measurements, the overall execution time was
precision of the navigation system for horizontal and vertical measured. The processing time includes both phoneme decod-
movement commands. This is due to the motor arrangement for ing and robot motion execution time. Each of the four commands
the endoscope holder and its range of movement. The range of was spoken ten times and three different lengths of command
horizontal movement is 110 and the vertical range is 60. A were used: 0.5, 0.8, and 1.8 s. This experiment quantita-
larger range of movement requires more significant moves per tively determined the maximum command transmission speed,
step. using RS232 serial-port communication and the intentional SRU
The displacement of the calibration image shows the change that is implemented using CMU Sphinx. The controller for the
in an object on the image plane that is located at a distance of robotic endoscope holder is proprietary to HIWIN and its re-
5 cm from the endoscope tip when the speech command exe- sponse speed cannot be altered. Table III shows the measured
cuted. For example, if an apple is being observed and command processing time for each command word. The robot starts to
a right movement is executed, only half of the apple might be move immediately after the first and second phoneme are veri-
visible when the movement is completed. In the case of a human fied and ends when the last phoneme is processed. The relative
heart, this limited view might be a problem during an operation. navigation distance depends on the length of the pronounced
The distances in Table II show that after the right movement, command word, as shown on Fig. 9. For the shortest length of
614 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017
VI. CONCLUSION
Of the different robotic control strategies, voice control is the
most convenient because it does not require the use of limbs.
Speech recognition liberates the assistant who is responsible
for handling an endoscope. This study develops a method for
the voice control of an endoscope holder, in terms of the intu-
itiveness of the interface, from the surgeons point of view. The
primary goal of the proposed methodology is to show the neces-
sity and usefulness of motion calibration in terms of the relative
change in the image for a voice command. The intuitiveness of
the system depends on the ISR engine, which maps the length
of the control word to the distance that the endoscope camera
navigates.
[18] P. Berkelman, E. Boidard, P. Cinquin, and J. Troccaz, LER: The light Chien-Yu Wu received the M.S. degree in elec-
endoscope robot, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. 2003, trical control engineering from National Chiao
Las Vegas, NV, USA, Oct. 2003, pp. 28352840. Tung University, Hsinchu, Taiwan, in 2015.
[19] ViKY Uterine Positioner. (2016). [Online]. Available: http://www. He had an internship with the Institute of
endocontrol-medical.com/press_release.php Mechatronic Systems in Leibniz University, Han-
[20] A. A. Gumbs, F. Crovari, C. Vidal, P. Henri, and B. Gayet, Modified nover, Germany. His research interests include
robotic lightweight endoscope (ViKY) validation in vivo in a porcine surgical continuum robotics and speech recog-
model, Surg. Innov., vol. 14, pp. 261264, 2007. nition.
[21] A. Perrakis et al., Integrated operation systems and voice recognition in
minimally invasive surgery: Comparison of two systems, Surg. Endosc.,
vol. 27, no. 2, pp. 575579, 2013.
[22] CMU Sphinx. (2016). [Online]. Available: http://cmusphinx.sourceforge.
net/ Kai-Tai Song (A91M09) received the B.S. de-
[23] PySerial. (2016). [Online]. Available: http://pyserial.sourceforge.net/ gree in power mechanical engineering from Na-
tional Tsing Hua University, Hsinchu, Taiwan, in
1979, and the Ph.D. degree in mechanical engi-
neering from the Katholieke Universiteit Leuven,
Leuven, Belgium, in 1989.
Since 1989, he has been with the Faculty and
Kateryna Zinchenko (S16) was born in Kyiv, is currently a Professor with the Institute of Elec-
Ukraine, in 1991. She received the B.S. de- trical Control Engineering, National Chiao Tung
gree in electronics engineering from National University, Hsinchu. His current research inter-
Chiao Tung University, Hsinchu, Taiwan, in 2013, ests include mobile robots, image processing,
where she is currently working toward the Ph.D. visual tracking, and mobile manipulation.
degree in electrical engineering and computer Dr. Song received the Excellent Automatic Control Engineering Award
science. of Chinese Automatic Control Society (CACS) and the Engineering Pa-
Her research interests include surgical per Award of Chinese Institute of Engineers. He received the best paper
robots, swarm robotics, shared control, artificial award of the IEEE ICSSE 2016, the IEEE ICAL 2012, and the CACS 2013
intelligence, VR. and 2014, respectively. He coached the NCTU Robotics team and won
Ms. Zinchenkos received the Best Student the first place of the University Challenge at the World Robot Olympiad
Paper Nomination (CACS 2014, Taiwan), the System and Architecture Qatar in 2015. He is a Fellow of CACS and currently serves as the Pres-
Talent Training Program Award (MOE 2014, Taiwan), and the Best Stu- ident of CACS, Taiwan.
dent Paper Award (ICCAS 2015, Korea).