You are on page 1of 9

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO.

2, APRIL 2017 607

A Study on Speech Recognition Control for a


Surgical Robot
Kateryna Zinchenko, Student Member, IEEE, Chien-Yu Wu, and Kai-Tai Song, Member, IEEE

AbstractSpeech recognition is common in electronic incisions using specialized long-range instruments and a camera
appliances and personal services, but its use for industrial view. For abdominal MIS, these incisions are arranged in a
and medical purposes is rare because of the presence of circle, and a laparoscope camera is inserted at the center [1]. At
motion ambiguity. For minimally invasive surgical robotic
assistants, this ambiguity arises because the robotic least one skilled assistant helps the surgeon during the operation
motion is not calibrated to the camera images. This paper by holding and moving the laparoscope, while the surgeon
presents a design for a speech recognition interface for manipulates the laparoscopic instruments using both hands [2].
an HIWIN robotic endoscope holder. A new intentional This task is as straightforward as it is tiring for the human assis-
speech control is proposed to control movement over long
tant: an endoscope is heavy and camera trembling and mishan-
distances. To decrease ambiguity, a method is proposed
for voice-to-motion calibration that compares the degree dling can occur after the first few hours of an operation. Routine
of change in the endoscope image for a voice command. tasks, such as camera handling, can be passed to a robotic holder.
A speech recognition algorithm is implemented on Ubuntu In the past, robots have been widely used for industrial pur-
OS, using CMU Sphinx. The control signal is sent to the poses in manufacturing. Modern production robots outperform
robot controller using serial-port communication through
humans, in terms of precision, speed, and throughput, in the
a RS232 cable. The experimental results show that the
proposed intentional speech control strategy has a naviga- performance of the specific tasks for which they are designed.
tion precision of up to 3.1 of angular displacement for the In MIS, no specific motion algorithm can be used for all oper-
endoscope. The overall system processing time, including ations. Surgeon-to-robot interaction is an inevitable step. Fully
robotic motion, is 3.22 s for 1.8-s speech duration. The robotized surgical systems, such as the state-of-the-art daVinci
reference image navigation range is from 2.5 mm for 0.5-s
system [3] from intuitive surgical, handle both the endoscope
speech duration up to 6 mm for 1.8-s speech duration,
using a setup with camera tip that is located at a distance and tool manipulators. More compact systems typically handle
of 5 cm from the remote center of motion point. only the endoscope. Robots use numerous control strategies that
Index TermsAutomated system, humanrobot inter- transform robotic holders from mere pieces of equipment into
face, motion control, robotic surgery, speech recognition intelligent assistants [4][6]. Li et al. demonstrated attention-
control. aware laparoscope navigation that uses eye tracking [7].
I. INTRODUCTION Nishikawa et al. proposed FAce MOUSe [8], whereby the
ODERN surgical practice has undergone significant facial expression of the surgeon is interpreted as a control signal
M changes since the introduction of minimally invasive
surgery (MIS). From the patients point of view, MIS typically
for the camera holder. While all of these methods are feasible,
the particular focus is on speech recognition, which is the
results in a faster recovery rate, smaller scars, and lesser damage most natural of the various surgeon-to-robot communication
to soft tissues, less pain, and a shorter hospital stay. However, methods [9].
from the doctors side, MIS requires hours of extra special The tools for natural speech recognition have been developed
training because of the specificity of the approach. Instead of quickly, with the emergence of powerful computing machines.
exposing patients organs to open air and direct manual inter- Many studies show examples of the successful use of speech
vention, surgeons perform operations through small (10 mm) control for mobile robots [10], humanoid robots [11], and aerial
robots [12]. Speech control has been well received by users
Manuscript received January 20, 2016; revised April 22, 2016, June (mobile and personal computer (PC) applications) and scientific
17, 2016, August 30, 2016, and October 7, 2016; accepted October 24, researchers, but its industrial applications are few. Recently, in-
2016. Date of publication November 7, 2016; date of current version April
18, 2017. This work was supported in part by the Ministry of Science dustrial robots have been equipped with a human-to-machine
and Technology under Grant MOST 103-2221-E-009-185. Paper no. interaction strategy that enables humanrobot collaboration and
TII-16-0058. (Corresponding author: K.-T. Song.) a more efficient use of manpower. Pires [13] showed the possible
K. Zinchenko is with the National Chiao Tung University, Hsinchu
30010, Taiwan (e-mail: zinchenko.kateryna@gmail.com). benefits of this type of humanrobot cooperation in the work-
C.-Y. Wu is with the Industry 4.0 Division, Fair Friend Group, Taipei, place, with improvements to robotic versatility and production
Taiwan 300 (e-mail: chien800614@hotmail.com). throughput. Even with these benefits, at its current stage of de-
K.-T. Song is with the Institute of Electrical Control Engineering,
National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: ktsong@ velopment, speech control is not entirely accepted by industry
mail.nctu.edu.tw). [14]. Rogowski [15] described problems that industry-oriented
Color versions of one or more of the figures in this paper are available voice control systems face and the requirements for their suc-
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TII.2016.2625818 cessful integration into an industry. The acceptance of robots in

1551-3203 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
608 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017

healthcare is similar to the situation for industry. Surgery is a


process of human repair, but there is a little knowledge of the
components specification.
The integration of voice control and robot-assisted MIS began
with computer motions introduction of the AESOP robot [16],
which was approved by the FDA in 1994. The AESOP robot
contains a laparoscopic camera that is fixed to the robotic arm,
near the operation table. The robotic arm is controlled using a
joystick or by voice commands. At the time, the major draw-
backs of this system were the latency of the speech recognition
engine and the low recognition rate. KaLAR [17] is an example
of another compact surgical assistant that is equipped with a
seven-command voice control system and instrument tracking.
In contrast to AESOP, KaLAR is fixed directly to the surgi-
cal table. In 2003, Berkelman et al. introduced the LER [18],
a light endoscope robot, which resulted in the creation of the
ViKY Uterine Positioner [19]. This robot is positioned directly
on the patients abdomen. The ViKY voice-control interface
is activated by keywords that are followed by commands. In
a comparative study [20], the ViKY voice recognition system
achieved a success rate of 71%, compared to the AESOP sys-
tem, which has a success rate of 67%. Given the fact that ViKY
is the newest commercially available MIS assistant that uses
voice control as the primary command input for endoscope po-
sitioning, there is room for improvement in speech control for
surgical applications.
The problems that voice control systems face in industry
are similar to those that occur in the medical field. Perrakis
et al. compared two existing integrated operation systems, which Fig. 1. Complete system consists of headset, SRU, PC-to-robot com-
munication unit, and robotic endoscope holder.
were developed to centralize the control of all components
within the operating room: the siemens integrated OR system
and Karl Storz OR1 [21]. Misunderstanding of commands is The remainder of this paper is organized as follows: Section II
one of the most common system faults, which cause irritation presents the proposed systems architecture and describes the
to surgeons. Another important factor is the ambiguity of the main components, such as the speech recognition unit (SRU)
robotic motion for each speech command. There is no standard that uses Sphinx [22] and the serial connection that uses PySerial
for this parameter so the specifications for commercially avail- [23]. Section III describes the design for the method to calibrate
able robotic assistants take no account of this factor. Speech voice control and the experimental setup. Section IV presents
control is often thought to be vague because the user cannot the experimental results. Section V discusses and summarizes
access the motive range of the robot for each command, so the the contributions of this paper.
machine is deemed to be inferior. In terms of controller de-
sign, one speech command can correspond to a movement of
5 mm or 5 cm. The absence of flexibility in the length of any II. ARCHITECTURE OF THE PROPOSED SYSTEM
motion results in intermittent and nonintuitive navigation. The During MIS, images from the camera are the only source
commands must be repeated if the camera is repositioned to a of information about a patients internal organs state and the
farther viewpoint. surgical instruments position. A robotic endoscope holder must
The benefits of voice control in MIS applications, in terms of safely and precisely handle a camera according to a surgeons
handling the camera holder, are important for the development commands. Doctors evaluate successful navigation as a function
of a medical-robot interface because this allows robots to be of the camera, in terms of whether the region of interest is in the
integrated into surgical teams. It is necessary to address any am- cameras view.
biguity in the robotic holders response to speech commands and The proposed system consists of a speech controller and a HI-
to devise a method to resolve the calibration of voice commands WIN robotic arm. The speech controller has two main functions:
for a range of robotic motion to that of an object of interest in speech recognition and serial-port communication between the
an endoscope image. A practical system must also allow real- host computer and the robotic controller. A block diagraman
time use and be highly capable of decoding speech. This paper overview of the systems componentsis shown in Fig. 1. A
proposes a design for an intentional speech recognition (ISR) voice command is received from the surgeons headset. The sig-
interface to control a 3-DOF HIWIN robotic endoscope holder nal from the headset is then transmitted to the SRU in the com-
during MIS, using a serial-port communication for PC to robot puter, where the voice is transformed into a text string. When the
command delivery. command line is generated, it is sent to the RS232 connection
ZINCHENKO et al.: A STUDY ON SPEECH RECOGNITION CONTROL FOR A SURGICAL ROBOT 609

TABLE I
HIWIN MTG-H100 KEYBOARD CONTROLS

Keyboard Character Robots Response

d Camera moves right, robots head moves left.


a Camera moves left, robots head moves right.
w Camera moves up, robots head moves down.
s Camera moves down, robots head moves up.
Fig. 2. HIWIN robotic endoscope holder can be controlled using foot i Camera moves toward the organs robots head stays still.
pedal controller or a keyboard input from PC. Primary control block ac- e Camera moves away from the organ, robots head is still.
cepts the input control signal and interprets it for the motor controllers. n Any movement stops execution.

unit, which handles the writing and reading of data from the
HIWIN robots data transmission port. Data that are forwarded
to the port must be properly encoded and data received from the
port must be correctly decoded. When the data have been sent,
the unit awaits the robots response. The synchronization of the
system depends on the communication speed of the RS232; new
user commands are not processed until there is a response from
the robot.

A. HIWIN Robotic Endoscope Holder MTG-H100


The robot that is used in this study is a surgical robot, designed
by HIWIN. It comprises a remote center of motion (RCM) mod-
ule that has 3 DOF (two rotational DOF and one translational
DOF) and an articulated automatic balancing system with 4 DOF
(three rotational DOF and one translational DOF) (see Fig. 2).
An automatic balancing system is manually operated using grav-
ity compensation and autobalance features. This has passive
drivers and a working range of 360 200 mm H 480 mm R.
The RCM module repositions the gripper to which the endo-
scope is attached. The permitted working range is 110 in the
left-right directions, 0 to 60 in the back-forward directions,
and the endoscopes insertion range is 200 mm. The maximum
load for the robot gripper is 2 kg, which is sufficient to hold an Fig. 3. SRU consists of three main parts. Speech controller is a com-
endoscope. All robotic motion is defined with respect to the en- munication block between SRU and RS232 block.
doscope insertion point, which is the embedded location of the
RCM point. When a command is received, the primary control terms left, right, etc., describe the robots motion from the
block automatically makes the necessary inverse kinematics cal- point of view of the operating surgeon. The command, cam-
culations for the joint commands for the three motors. A pedal era moves right causes the scene in the cameras viewfinder
controller that is provided by HIWIN is used to send control to move right. When a control command is received, it is con-
inputs to the robot. The controller consists of two pedals and tinuously executed by the robot until the controller receives a
a joystick. The pedals control the injection and ejection of the stop command. If a stop command is not received and the
endoscope and the joystick controls the endoscopes movements robot reaches the limit of its movement, it continues to send in-
to the left, right, forward, and backward. For safety reasons, the formation about its state but does not move farther. All of these
middle of the controller has a laser sensor; the robot executes commands are proprietary to HIWIN and cannot be altered. The
the foot controllers commands only if the laser is covered by limit of the robots range is defined using sensors on the right
the operators foot. The foot pedal strictly controls the amount and left sides of the robots head. The proposed system substi-
of motion that is transmitted to the robot and the robot only tutes a MISUMI endoscope camera for a medical endoscope.
moves if the joystick or the pedal is operated and stops when The camera is equipped with an LED light and has a flexible
there is no input or when the laser is not covered. Therefore, the body and a diameter of 0.9 mm.
operator can select continuous or discrete robotic motion, using
the foot pedal. The robots activity can also be controlled by the
host computer (PC), using a USB-RS232 connector. PuTTY is B. Design of the SRU
used in serial mode to make the connection with the robotic A speech recognition system must capture a speech signal,
controller. In terms of input, the robot accepts keyboard char- process the signal, extract features, and perform speech recog-
acter commands, WASD + EIN. Table I shows a complete list nition. A block diagram of the overall SRU system is shown
of keyboard commands and the robots responses to them. The in Fig. 3. The front-end controls feature extraction and end-
610 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017

point detection and is responsible for gathering, annotating, and An additional set of robot motion commands is required for
processing the input data. The front end also extracts features ISR to produce a longer movement. Matching the execution
from the input data that are read by a decoder. The annota- time with the period of the pronounced speech, which gives
tions that are provided by the front end include the beginning real voice control, is the optimal solution to navigation over
and the end of a data segment. At the input stage, the system longer distances using speech recognition. This function would
continuously receives and records the signal from the headset. further decrease the ambiguity that can arise when there is uncer-
This approach allows the system to estimate the average noise tainty about the duration of robotic movement for each control
level inside an operating room and to adjust the threshold value word, because the surgeon defines the period using pronunci-
at which signal processing begins. Determining the end of the ation. Different forms of this methodology are present in all
command speech is crucial in voice recognition applications. manual controllers; while there is an input signal the robot
A signal that is greater than the threshold value indicates the moves. The proposed system may be able to extract the nec-
start point of speech and a significant decrease in the signal am- essary information by interpreting the intention of the speech
plitude indicates the end point. If the threshold is not properly command.
calibrated, the system might misunderstand background noise Fig. 4 shows the plot for the HMM as it searches for the most
as a control command. suitable phoneme sequence that is related to the input word.
The knowledge base consists of three components: a dictio- The process of speech recognition determines the best possi-
nary, a language model, and an acoustic model. The knowledge ble sequence of words that fit the given input speech. This can
base provides information that the decoder needs to process be simplified to the matching of the sequence of phonemes to
signals. This information includes the acoustic model and the a specific word. Instead of sampling and decoding the entire
language model. The feedback from the decoder allows the voice signal, the input signal is sampled in terms of short ut-
knowledge base to dynamically improve itself, using success- terances and phonemes are detected in real time (see Fig. 5.)
ful search results. However, this feedback is not required for After sampling, the utterance is processed at the front end, the
restricted vocabulary systems. decoder and, finally, the phonemes are detected. A sequence of
The SRU uses the Hidden Markov Model (HMM) recognition phonemes is the output.
engine to train specific speech patterns and to extract words from At the next block, these phonemes are searched for the pres-
an acoustic signal and present them in the form of Mel Frequency ence of a phoneme that starts a command. For example, the
Cepstrum Coefficients. Filtering and sampling of the acoustic first received phoneme is l, which corresponds to the speech
signal precedes feature extraction. A sampling frequency of command, left. If the next phoneme corresponds to the second
16 kHz is used for this speech recognition system. phoneme of the word, leftnamely EHthe system sends
The decoder block, which comprises a hypothesis search and the corresponding movement command and continues to search
a state probability computation, performs the principal part of new phonemes for the word ending. Repeated vowels do not af-
the search. It selects the next set of likely states, scores incoming fect the search. In the opposite case, if phonemes do not match,
features against these states, dismisses the low-scoring states, a stop command is sent to the controller. Therefore, a robots
and finally generates results by selecting the most probable move signal is sent when the second phoneme matches with
hypothesis. The result is presented in the form of a text string the corresponding control words phoneme. The robots stop
that is chosen from the set of allowed robotic commands. The signal is transmitted when the last phoneme is decoded or if the
voice control system uses the CMU Sphinx open-source project phoneme does not match the control word, or if the recognition
that was developed by Carnegie Mellon University [15]. CMU system reaches the end of the speech (silence is detected). Four
Sphinx is provided in the form of software development tools control words are used for ISR: right, left, front, and back. Each
and libraries. word contains only one vowel that can be extended during pro-
nunciation: I, e, o, and a. During the experiments using
short and long pronunciation of these words, the vowel sounds
C. Recognition of Intentional Speech that are pronounced when the words are short are different from
One of the drawbacks of using voice control is that it is in- the vowel sounds that are decoded when the words are pro-
appropriate for long-distance motion. To move an endoscopic longed. Namely, the vowel e, in the first case is represented
camera a significant distance, the same command may have to only by the EH phoneme, but in second case, it is represented
be repeatedly given until the destination is reached. The natural by a set of EH and IY phonemes. Similarly, the phoneme
human reaction is to prolong words, so instead of left, the com- a, is represented by a set of AE and AH phonemes, o is
mand might be leeeeft, to indicate a longer motion. These two represented by a set of AH and OW phonemes, and i is
words sound the same to the human ear, but a computer registers represented by an AY phoneme. This change occurs because
a significant difference. The sounds of language are classified vowels are pronounced differently in the presence or absence
into what are termed phonemes. A phoneme is a minimal unit of of a consonant. This transformation of the phoneme means that
sound that has a semantic content., e.g., the phoneme AE ver- the addition of alternative representations for vowel phonemes
sus the phoneme EH defines the difference between the sounds is a crucial factor in the development of the system. In terms of
a and e in the words bat and bet. In terms of speech de- the design of the system, each of the vowels has two possible
coding, the main difference between the two terms, left and representations. During the experiments, the pronunciation of
leeeeft, is in the length of the input region that is sampled consonants remained constant, so they have only one form of
during voice recording and the repeated phoneme EH. representation.
ZINCHENKO et al.: A STUDY ON SPEECH RECOGNITION CONTROL FOR A SURGICAL ROBOT 611

Fig. 4. HMM for phonemes of word left and leeeft. If a dictionary contains words that can be distinguished by the very first phoneme, it is possible
to immediately identify the control word, generate appropriate robot command, and use consecutive phonemes for confirmation only. During the
same time, the duration of the rest of phonemes can indicate the duration of robots motion, allowing wider robots movement range.

Fig. 5. ISR block diagram. To be able to dispatch the robot command while person speaks, it is necessary to recognize the word from the first or
second letter. Thus, we use phoneme detection and map them against the phonemes that compose the words in dictionary.

and written to the port. The robots response is read from the
port, decoded, and checked for validity. The robots outputs dif-
ferent strings for MOVE and STOP commands. If a command
is not executed properly, another control message is written to
the port again until the robot produces the desired output. When
the command has been executed successfully, the serial port is
ready to receive a new command from the speech command
controller. The PC-to-robot communication uses the principle,
send-confirm receive-send-confirm. A write to the port follows
reading and analysis of the robots output message. In this study,
the PySerial open-source library is used for the serial control
block. PySerial is written in a Python programming language.

E. Development Environment
The voice control system is implemented on a PC, running
Ubuntu LTS 14.02 OS, Python programming language build 2.7
and an Intel i7 processor that uses PyCharm development envi-
ronment. The CMU Sphinx open-source library is used for the
speech recognition algorithm and open-source PySerial library
is used for RS232 communication programming. The entire
system is written in a Python language.
Fig. 6. Block diagram of RS232 communication unit, which is respon-
sible for command consistency check and command flow control.
III. SPEECH-TO-MOTION CALIBRATION
During MIS, one assistant usually holds the endoscope and
D. Serial-Port Communication
moves it in accordance with the surgeons commands. During
The PC is connected to the HIWIN MTG1000 robotic arm natural speech interaction, it is common to use ambiguous com-
using a serial-connection protocol. The serial-connection archi- mands, such as a little to the right or move more to the left.
tecture is shown in Fig. 6. First, the received decoded speech The assistant must understand the distance that the endoscope
command is checked for validity against the set of available must be moved to achieve the desired view. Since camera video
robot commands. The robot commands are divided into two is the only source of information about the situation inside a
groups: MOVE (right, left, front, back, in, and out) and STOP patient, a surgeons commands relate to the observed image,
(stop). After validation, the control string is converted to bytes rather than the change in the endoscopes position.
612 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017

Fig. 8. Summary of the algorithm steps for calibration of robot motion


to endoscope camera view.

robot, the robot navigates the endoscope to the new position.


The shift in the observed object is estimated using the distance
that is traveled from the geometrical center of the image frame.
Fig. 7. (a) Experimental setup for acquiring the calibration image, the
endoscope lenses are located 5 cm from the RCM point. In this picture,
The accuracy is calculated in terms of the angular displacement
the robot is in home position and (b) example from camera image of the endoscope. Given that the exact size of the squares in
showing calibration page during execution of the right motion command. the calibration image is known, the reference calibration image
Traveled distance was calculated between solid lines intersection and
white triangle.
displacement for this setup is calculated in millimeter. Fig. 7(b)
shows samples of travel by the endoscope camera in the right
direction with respect to the calibration image. The geometri-
For a robot, there is no ambiguity in the execution of cal centers of the image are shown as solid white lines. In this
commands. When a command is dispatched, the robot executes way, it is possible to estimate the change that is observed in
exactly the programmed amount of motion, in terms of the real- the object on the screen after the robot moves. A maximum of
world position. Therefore, a surgeon who is operating with a two commands in the same direction are used for measurement
robotic assistant must know how much image movement to ex- because there is distortion in the camera lenses. The robot is
pect in reaction to a voice command. In fact, the surgeon is more then returned to the zero position and the test is repeated.
interested in the change in the view of the object on the screen To calculate image displacement in millimeter, the test was
than in the actual distance that the robot moves. For example, if repeated ten times and the results for each command were av-
an organ is observed in the image plane and the robot executes a eraged. To estimate the change in the camera angle for each
move right command, the percentage of the organ that disappears command, commands were repeated until the robot reached its
when the action is completed can be predicted. To overcome the motion limit from the home position in either direction. The
ambiguity in expectations about the image view, a method for overall range of the robots movement is divided by the number
the calibration of robotic endoscope holder is proposed. of commands that are sent before the robot reaches its limit. Typ-
ically, the maximum robotic movement range value is specified
A. Precision Measurement of the Navigation System by the manufacturer. This test was repeated five times for each
voice command and the data were averaged for each control
The key calibration value, from the doctors point of view, is
sample.
the amount of change in the observed image. Calibration results
are a quantitative value that describe the robots behavior and
are a reference for tuning the robot for operations with special B. Motion-to-Image Calibration Method
constraints on the robots motion (in a confined space). To per- The steps for the method are summarized in Fig. 8. The
form measurement, the robot is positioned in the home position method comprises a set of tests that must be performed to ob-
(a vertical and horizontal angle of 0) and a calibration image serve and calibrate the performance of the system, in terms of
is placed under the endoscope, such that the geometrical center the change in the image for each movement of the robot. The
of the endoscope image frame corresponds to the geometrical results allow proper calibration between speech and the image
center of the calibration image. The endoscopes tip is placed at that is provided by the camera. The primary goal of calibra-
a distance of 5 cm from the robots RCM point and 5 cm from tion is to define the best pattern of robot motion for the exist-
the picture plane, which roughly corresponds to the position of ing configuration of endoscope, robotic holder, and operation
the endoscope during an operation. Fig. 7(a) shows the experi- space. This calibration method can be used in industrial appli-
mental setup. The calibration image is divided into squares with cations that require control or adjustment of a robot using video
sides of length 2 mm. When the speech command is sent to the information. The range of motion for industrial robots is
ZINCHENKO et al.: A STUDY ON SPEECH RECOGNITION CONTROL FOR A SURGICAL ROBOT 613

TABLE II TABLE III


CALIBRATION MEASUREMENT RESULTS LATENCY RESULTS OF ISR

Speech Pedal Control Voice Control Pedal Control Voice Control Speech Command Utterance Duration
Command
endoscope angular Corresponding calibration 0.5 s 0.8 s 1.8 s
displacement precision image displacement mm
Right 1.58 1.62 3.34
Right 12.3 7.7 4 2.5 Left 1.19 1.86 3.44
Left 12.3 7.7 4 2.5 Back 1.55 1.69 3.22
Back 3.9 3.1 2.3 1.8 Front 1.19 1.67 3.34
Forward 3.9 3.1 2.3 1.8

the apple moves only 2 mm to the right (this is the smallest dis-
typically greater than the range that is required for surgical
tance that the current system can achieve and different values
tasks so the calibration image must be adjusted accordingly.
can be used). Therefore, a surgeon can change the image view
without suddenly losing half of the apple. The change in the
IV. EXPERIMENTAL RESULTS
calibration image depends on the angular displacement of the
The SRU lies at the heart of the proposed system, so it is endoscope.
crucial that it is reliable. The first experiment determined the
success rate for the SRU that is shown in Fig. 3. A person B. Pedal Navigation Experiment Results
commanded the robot to move, using a headset, and the response
from the SRU was recorded. Command strings were classified To verify the voice-controlled system, the precision experi-
as a success or a fail. Every command was repeated ten ment was repeated using the foot pedal controller that was pro-
times by the same person. The average recognition rate was vided with the robotic system. To ensure an accurate comparison
90%, with a standard deviation of 5%. The processing time of the distances, the equipment and the calibration image that
for the SRU was 0.27 s, when the sampled utterance had been were used were the same as those used for the systems nav-
input to the system. A series of tests then determined the robots igation test. The joystick pedal was depressed in the desired
navigation precision, using the SRU, including a comparative direction until the robot performed the motion and then im-
test using the foot pedal. Two experiments were then performed mediately released. The images that were acquired are shown
using a separate ISR unit. The test dataset included 120 voice in Fig. 7(b). The relative distance that was traveled from the
commands: ten for each command (right, left, front, and back), starting point was then calculated. To determine the precision
and in each the word was uttered for a different length of time in degrees and the displacement of the calibration image for
(0.5, 0.8, 1.8 s). In the first experiment, the ISR latency was each single command, the methodology for the first experiment
calculated with respect to the different lengths of the commands. was used. The results of this test are summarized in Table II. It
In the second experiment, the change in the object on the image is seen that the overall system performance is consistent with
plane was measured for speech commands of different lengths. expectations for a voice-controlled robotic assistant. There is no
The system is designed to move the endoscope farther if the significant difference between the precision of navigation us-
command is longer. ing speech control and pedal control, which demonstrates that
voice-controlled systems perform sufficiently for practical use.
A. Results for System Navigation Precision
Table II shows the calibration measurement results for the C. ISR Latency and Precision
proposed robotic system. In the experiment, the best navigation A practical system must execute commands within a reason-
precision that the proposed system achieves 3.1 for the forward able time. This experiment calculated the latency of the ISR sys-
and backward tasks. It is seen that there is a difference in the tem. For latency measurements, the overall execution time was
precision of the navigation system for horizontal and vertical measured. The processing time includes both phoneme decod-
movement commands. This is due to the motor arrangement for ing and robot motion execution time. Each of the four commands
the endoscope holder and its range of movement. The range of was spoken ten times and three different lengths of command
horizontal movement is 110 and the vertical range is 60. A were used: 0.5, 0.8, and 1.8 s. This experiment quantita-
larger range of movement requires more significant moves per tively determined the maximum command transmission speed,
step. using RS232 serial-port communication and the intentional SRU
The displacement of the calibration image shows the change that is implemented using CMU Sphinx. The controller for the
in an object on the image plane that is located at a distance of robotic endoscope holder is proprietary to HIWIN and its re-
5 cm from the endoscope tip when the speech command exe- sponse speed cannot be altered. Table III shows the measured
cuted. For example, if an apple is being observed and command processing time for each command word. The robot starts to
a right movement is executed, only half of the apple might be move immediately after the first and second phoneme are veri-
visible when the movement is completed. In the case of a human fied and ends when the last phoneme is processed. The relative
heart, this limited view might be a problem during an operation. navigation distance depends on the length of the pronounced
The distances in Table II show that after the right movement, command word, as shown on Fig. 9. For the shortest length of
614 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. 13, NO. 2, APRIL 2017

ample, using a different length of command, a modulation


worker could temporarily slow a robot to check on its working
process.

VI. CONCLUSION
Of the different robotic control strategies, voice control is the
most convenient because it does not require the use of limbs.
Speech recognition liberates the assistant who is responsible
for handling an endoscope. This study develops a method for
the voice control of an endoscope holder, in terms of the intu-
itiveness of the interface, from the surgeons point of view. The
primary goal of the proposed methodology is to show the neces-
sity and usefulness of motion calibration in terms of the relative
change in the image for a voice command. The intuitiveness of
the system depends on the ISR engine, which maps the length
of the control word to the distance that the endoscope camera
navigates.

Fig. 9. Navigation distance of intentional SRU under proposed exper- REFERENCES


imental setup with respect to the length of the pronounced word. The
calculated results will vary with setup change. [1] F. Zinzindohoue et al., Laparoscopic gastric banding: A minimally in-
vasive surgical treatment of morbid obesity: Prospective study of 500
consecutive patients, Ann. Surg., vol. 237, no. 1, pp. 19, 2003.
command 0.5 s, the image on the screen shifts 2.5 mm and for [2] T. P. Cundy et al., The first decade of robotic surgery in children,
the longest length of command 1.8 s, the displacement of the J. Pediatr. Surg., vol. 48, no. 4, pp. 858865, 2013.
image is 6 mm, for the given experiment setup. [3] C. Freschi, V. Ferrari, F. Melfi, M. Ferrari, F. Mosca, and A. Cuschieri,
Technical review of the Da Vinci surgical telemanipulator, Int. J. Med.
Robot. Comput. Assisted Surg., vol. 9, pp. 396406, 2013.
V. DISCUSSION
[4] L. S. G. L. Wauben et al., Application of ergonomic guidelines during
A major concern for speech control is the intermittent motion minimally invasive surgery: A questionnaire survey of 284 surgeons,
Surg. Endosc. Other Interventional Techn., vol. 20, no. 8, pp. 12681274,
that it produces, which makes navigation over a long distance 2006.
a tedious task. This is mainly due to ambiguity in the speech [5] J. M. Gilbert, The EndoAssistTM robotic camera holder as an aid to
control method that pertains to the duration of the robots move- the introduction of laparoscopic colorectal surgery, Ann. Roy. College
Surgeons Engl., vol. 91, no. 5, pp. 389393, 2009.
ment for each control word. Intentional speech that considers [6] M. Quigley et al., Semi-autonomous human-UAV interfaces for fixed-
the length of the command eliminates the necessity for a large wing mini-UAVs, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.,
vocabulary and makes speech control more natural. Using an Sendai, Japan, 2004, pp. 24572462.
[7] S. Li, J. Zhang, L. Xue, F. J. Kim, and X. Zhang, Attention-aware robotic
intention-based robotic control arm, a surgeon can adjust the laparoscope for human-robot cooperative surgery, in Proc. IEEE Int.
distance that the robot moves using only voice commands. Conf. Robot. Biomimetics, Shenzhen, China, 2013, pp. 792797.
This paper proposes a calibration method for the robot that [8] A. Nishikawa et al., Face MOUSe: A novel human-machine interface
for controlling the position of a laparoscope, IEEE Trans. Robot. Autom.,
estimates the change in an image for a given voice command. vol. 19, no. 5, pp. 825841, Oct. 2003
The results are given in both degrees of angular displacement for [9] L. Barkhuus and V. E. Polichar, Empowerment through seamfulness:
the endoscope and the change in the corresponding calibration Smartphones in everyday life, Pers. Ubiquitous Comput., vol. 15, no. 6,
pp. 629639, 2011.
image in millimeter. The displacement in the calibration image [10] K. Bojan et al., Mobile robot controlled by voice, in Proc. Int. Symp.
depends on the endoscopes position and orientation. When the Intell. Syst. Informat., 2007, pp. 189192.
robots angular velocity is constant, the displacement in degrees [11] Y. Lu et al., Voice-based control for humanoid teleoperation, in Proc.
Int. Conf. Intell. Syst. Des. Eng. Appl., Oct. 2010, pp. 814818.
is the most sustainable indicator of the precision of the robots [12] M. H. Draper et al., Multi-unmanned aerial vehicle systems control via
movement. However, surgeon could prefer a voice to motion flexible levels of interaction: An adaptable operator-automation interface
matching, not in terms of angular displacement, but in terms of concept demonstration, in Proc. Infotech Aerosp. Conf., vol. 1, 2013, pp.
691715.
linear displacement on the image plane, which would require [13] J. Pires, Industrial Robots Programming: Building Applications for the
adapting robot commands in function of the orientation of the Factories of the Future. New York, NY, USA: Springer, 2006.
endoscope. The calibration procedure makes it possible to link [14] S. Profanter, A. Perzylo, N. Somani, M. Rickert, and A. Knoll, Analy-
sis and semantic modeling of modality preferences in industrial human-
image displacement with the robots angular displacement for robot interaction, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.,
different insertion depths for the endoscope. If the robot knows 2015, pp. 18121818
the depth of insertion, the adapted robot commands relate the [15] A. Rogowski, Industrially oriented voice control system, Robot.
Comput.-Integr. Manuf., vol. 28, no. 3, pp. 303315, 2012.
amount of change in the image plane using different camera- [16] Nathan et al., The voice-controlled robotic assist scope holder AESOP
object orientations. for the endoscopic approach to the Sella, Skull Base, vol. 16, no. 3,
Although the method and the system are designed for surgi- pp. 123131, 2006.
[17] J. Kim, Y.-J. Lee, S.-Y. Ko, D.-S. Kwon, and W.-J. Lee, Compact camera
cal tasks, they are not limited to this field. Intentional speech assistant robot for minimally invasive surgery: KaLAR, in Proc. Int. Conf.
control is useful in many humanrobot interactions. For ex- Intell. Robots Syst., 2004, pp. 25872592.
ZINCHENKO et al.: A STUDY ON SPEECH RECOGNITION CONTROL FOR A SURGICAL ROBOT 615

[18] P. Berkelman, E. Boidard, P. Cinquin, and J. Troccaz, LER: The light Chien-Yu Wu received the M.S. degree in elec-
endoscope robot, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. 2003, trical control engineering from National Chiao
Las Vegas, NV, USA, Oct. 2003, pp. 28352840. Tung University, Hsinchu, Taiwan, in 2015.
[19] ViKY Uterine Positioner. (2016). [Online]. Available: http://www. He had an internship with the Institute of
endocontrol-medical.com/press_release.php Mechatronic Systems in Leibniz University, Han-
[20] A. A. Gumbs, F. Crovari, C. Vidal, P. Henri, and B. Gayet, Modified nover, Germany. His research interests include
robotic lightweight endoscope (ViKY) validation in vivo in a porcine surgical continuum robotics and speech recog-
model, Surg. Innov., vol. 14, pp. 261264, 2007. nition.
[21] A. Perrakis et al., Integrated operation systems and voice recognition in
minimally invasive surgery: Comparison of two systems, Surg. Endosc.,
vol. 27, no. 2, pp. 575579, 2013.
[22] CMU Sphinx. (2016). [Online]. Available: http://cmusphinx.sourceforge.
net/ Kai-Tai Song (A91M09) received the B.S. de-
[23] PySerial. (2016). [Online]. Available: http://pyserial.sourceforge.net/ gree in power mechanical engineering from Na-
tional Tsing Hua University, Hsinchu, Taiwan, in
1979, and the Ph.D. degree in mechanical engi-
neering from the Katholieke Universiteit Leuven,
Leuven, Belgium, in 1989.
Since 1989, he has been with the Faculty and
Kateryna Zinchenko (S16) was born in Kyiv, is currently a Professor with the Institute of Elec-
Ukraine, in 1991. She received the B.S. de- trical Control Engineering, National Chiao Tung
gree in electronics engineering from National University, Hsinchu. His current research inter-
Chiao Tung University, Hsinchu, Taiwan, in 2013, ests include mobile robots, image processing,
where she is currently working toward the Ph.D. visual tracking, and mobile manipulation.
degree in electrical engineering and computer Dr. Song received the Excellent Automatic Control Engineering Award
science. of Chinese Automatic Control Society (CACS) and the Engineering Pa-
Her research interests include surgical per Award of Chinese Institute of Engineers. He received the best paper
robots, swarm robotics, shared control, artificial award of the IEEE ICSSE 2016, the IEEE ICAL 2012, and the CACS 2013
intelligence, VR. and 2014, respectively. He coached the NCTU Robotics team and won
Ms. Zinchenkos received the Best Student the first place of the University Challenge at the World Robot Olympiad
Paper Nomination (CACS 2014, Taiwan), the System and Architecture Qatar in 2015. He is a Fellow of CACS and currently serves as the Pres-
Talent Training Program Award (MOE 2014, Taiwan), and the Best Stu- ident of CACS, Taiwan.
dent Paper Award (ICCAS 2015, Korea).

You might also like