You are on page 1of 13

MR-based synthetic CT generation using a deep convolutional neural network

method
Xiao Hana)
Elekta Inc., Maryland Heights, MO 63043, USA
(Received 18 October 2016; revised 31 January 2017; accepted for publication 5 February 2017;
published 21 March 2017)
Purpose: Interests have been rapidly growing in the field of radiotherapy to replace CT with mag-
netic resonance imaging (MRI), due to superior soft tissue contrast offered by MRI and the desire to
reduce unnecessary radiation dose. MR-only radiotherapy also simplifies clinical workflow and
avoids uncertainties in aligning MR with CT. Methods, however, are needed to derive CT-equivalent
representations, often known as synthetic CT (sCT), from patient MR images for dose calculation
and DRR-based patient positioning. Synthetic CT estimation is also important for PET attenuation
correction in hybrid PET-MR systems. We propose in this work a novel deep convolutional neural
network (DCNN) method for sCT generation and evaluate its performance on a set of brain tumor
patient images.
Methods: The proposed method builds upon recent developments of deep learning and convolu-
tional neural networks in the computer vision literature. The proposed DCNN model has 27 convolu-
tional layers interleaved with pooling and unpooling layers and 35 million free parameters, which can
be trained to learn a direct end-to-end mapping from MR images to their corresponding CTs. Train-
ing such a large model on our limited data is made possible through the principle of transfer learning
and by initializing model weights from a pretrained model. Eighteen brain tumor patients with both
CT and T1-weighted MR images are used as experimental data and a sixfold cross-validation study is
performed. Each sCT generated is compared against the real CT image of the same patient on a
voxel-by-voxel basis. Comparison is also made with respect to an atlas-based approach that involves
deformable atlas registration and patch-based atlas fusion.
Results: The proposed DCNN method produced a mean absolute error (MAE) below 85 HU for 13
of the 18 test subjects. The overall average MAE was 84.8  17.3 HU for all subjects, which was
found to be significantly better than the average MAE of 94.5  17.8 HU for the atlas-based method.
The DCNN method also provided significantly better accuracy when being evaluated using two other
metrics: the mean squared error (188.6  33.7 versus 198.3  33.0) and the Pearson correlation
coefficient(0.906  0.03 versus 0.896  0.03). Although training a DCNN model can be slow,
training only need be done once. Applying a trained model to generate a complete sCT volume for
each new patient MR image only took 9 s, which was much faster than the atlas-based approach.
Conclusions: A DCNN model method was developed, and shown to be able to produce highly accu-
rate sCT estimations from conventional, single-sequence MR images in near real time. Quantitative
results also showed that the proposed method competed favorably with an atlas-based method, in
terms of both accuracy and computation speed at test time. Further validation on dose computation
accuracy and on a larger patient cohort is warranted. Extensions of the method are also possible to
further improve accuracy or to handle multi-sequence MR images. © 2017 American Association of
Physicists in Medicine [https://doi.org/10.1002/mp.12155]

Key words: convolutional neural network, deep learning, MRI, radiation therapy, synthetic CT

1. INTRODUCTION ionizing radiation and offers superior soft tissue contrast that
allows more accurate target and structure delineation. A com-
Computed Tomography (CT) imaging has been traditionally plete workflow based solely on MRI can further eliminate
used as the primary source of image data in the planning pro- image registration uncertainties for combining MRI with CT
cess of external radiation therapy. CT images offer accurate and reduce clinical workload. The main challenges in replac-
representation of patient geometry, and CT values can be ing CT with MRI are that the MRI intensity values are not
directly converted to electron densities for radiation dose cal- directly related to electron densities and conventional MRI
culation. On the other hand, CT images have limited soft tis- sequences cannot obtain signal from bone. It is therefore
sue contrast and result in extra radiation to the patients. desirable to have a method that can derive CT-equivalent
In recent years, interests in replacing CT with magnetic information from MR images for the purposes of dose calcu-
resonance imaging (MRI) in the treatment planning process lation during treatment planning and DRR-based patient
have grown rapidly.1,2 This is mainly because MRI is free of setup. Such MR-based CT-equivalent data are often referred

1408 Med. Phys. 44 (4), April 2017 0094-2405/2017/44(4)/1408/12 © 2017 American Association of Physicists in Medicine 1408
1409 Han: Deep learning synthetic CT generation 1409

to as pseudo-CT or synthetic CT (sCT). In addition to MR- In this work, we propose a novel learning-based approach
only radiotherapy, generating accurate sCTs from MR images for sCT generation that exploits recent developments in deep
is also important for PET attenuation correction in hybrid learning and convolutional neural networks (CNNs).
PET/MRI systems.3 It should be noted that the inferior geo- CNNs35–37 are multi-layer, fully trainable models that can
metric fidelity of MRI can still be an issue when applying capture and represent complex, high-dimensional input–out-
sCT for these applications, since the sCT generated will put relationships. They have become the method of choice in
inherit any geometric distortion that exists in the source MR many fields of computer vision, and also been applied for
image. computer-aided detection and medical image segmentation
Many different methods have been proposed in the litera- problems.38–41 One appealing property of CNNs is their abil-
ture for automatic generation of sCTs from MR images. ity to learn complex models end-to-end instead of hand-craft-
They can be roughly divided into three categories: tissue- ing individual components (such as what image features to
segmentation-based, learning-based, and atlas-based compute and how to map the features to the desired outputs).
approaches. Tissue segmentation approaches first segment We thus apply CNN to learn a direct image-to-image map-
or classify MR image voxels into a discrete set of tissue ping between MR images and their corresponding CTs. This
types, such as air, fat, soft tissue, and bone, and then use is different from all existing learning-based sCT methods,
bulk density assignments to assign a different CT number which require hand-designed features and predict CT values
for each tissue type.4–11 As pointed out in Hsu et al.,8 tissue voxel-by-voxel. To our knowledge, this is the first time that a
segmentation is not an easy problem and one MR deep CNN method is developed for sCT generation. It should
image volume is usually insufficient to separate all major be noted, however, that a non-convolutional neural network
tissue types. In addition, conventional MR sequences can- method was proposed recently for cross-modality image syn-
not reliably differentiate bone from air. Consequently, thesis.42 But unlike what is proposed here, the method was
most tissue segmentation methods require the use of multiple patch-based and need be applied in a sliding window fashion
MR sequences, especially the specialized ultrashort echo to generate image synthesis voxel-by-voxel. A small image
time (UTE) sequences. This, however, may lead to longer patch (3 9 3 9 3 in Nguyen et al.42) cannot provide suffi-
image acquisition time and more complicated workflow. cient image context and also limits the model capacity.
Existing learning-based approaches employ statistical Hence, the authors42 found it necessary to incorporate spatial
learning or model fitting techniques to construct a mapping coordinates as extra features, which would require all images
function that associates MR voxel intensities (or intensity pat- to be registered or spatially normalized.
terns) with corresponding CT numbers.12–19 Similar to the Due to limited training data and computer memory restric-
case of tissue segmentation, using conventional MR images tions, the CNN model in this study is designed for 2D, in the
alone is insufficient to make reliable predictions. Some meth- sense that it takes a 2D MR slice as input and outputs a corre-
ods require manual delineation of the bone volume so that sponding 2D sCT slice. A 3D MR image can be processed
separate models could be built for different regions.13,14 slice-by-slice to generate a corresponding 3D sCT volume.
Alternatively, multiple sequences, especially UTE sequences, This 2D procedure is still much more efficient than voxel-by-
are often used to help distinguish bone voxels from air.16,17 In voxel predictions, since a typical 3D image can have millions
addition to MR intensity values, various local intensity pat- of voxels but only a couple hundred slices at most. Our exper-
tern and coordinate-related features were also designed and iments showed that a full 2D slice already provides sufficient
used as extra inputs to improve the prediction accuracy of a contextual information for the model to reliably separate
sCT model.15,17,18 locally similar voxels and produce very accurate sCT results.
Atlas-based approaches apply image registration to The method can, however, be extended in the future to take
align a target MR image to an atlas MR, the correspon- multiple slices as input or directly process a 3D volume once
dence information can then be used to warp the associated data and computer resources become available.
atlas CT image to the target MR to generate a sCT.20–33
Such approaches have become very popular due to their
potential of producing reliable sCT estimations using 2. MATERIALS AND METHODS
conventional MR images, as demonstrated in the litera-
2.A. Data acquisition and image preprocessing
ture.20–33 Pure atlas-based methods require accurate
deformable image registration of atlas and patient MR The reported study was performed on real image data from
images. It, however, can be difficult to accurately register 18 patients, which were randomly selected from a collection
a patient image with an arbitrary atlas, especially when of patients having undergone stereotactic radiosurgery with
there are large anatomical variations or pathological differ- the Leksell Gamma Knifeâ. The MR image for each patient
ences. These difficulties can be partially overcome through was acquired with a 1.5 T Siemens Avanto scanner using a
the use of multiple atlases and the design of atlas fusion T1-weighted 3D spoiled gradient recalled echo sequence
methods that are robust to atlas registration errors (such as without contrast agent (echo time 4.6 ms, repetition time
patch-based fusion and sparse coding).20–22,24,29,30,32 11 ms, flip angle 20 , voxel size 1 9 1 9 1 mm3, field-of-
Hybrid methods have also been proposed that combine view 256 9 256 9 160 mm3, total scan time 4 min). The
atlas-based methods with model-based prediction.33,34 corresponding CT image was acquired on a Siemens

Medical Physics, 44 (4), April 2017


1410 Han: Deep learning synthetic CT generation 1410

(a) (b) (c) (d)

FIG. 1. Illustration of data preprocessing. A 2D axial slice is shown in each subfigure for: (a) the MR image; (b) the computed head mask; (c) the original CT
image; (d) the CT image with the stereotactic head frame masked out.

Sensation 16 scanner with tube voltage 120 kV, exposure function to convert a 2D MR slice to its corresponding 2D
300 mAs, in-plane resolution 0.59 0.5 mm2, and slice thick- CT. The model can be trained by collecting all 2D MR slices
ness of 1 mm. with corresponding 2D CT slices from each training subject’s
Each patient’s MR/CT image pair was aligned rigidly 3D MR/CT pair. Once the model is trained, it can be applied
using a mutual information rigid registration algorithm, and on a new MR image slice-by-slice and the results can be
the CT was resampled to match the resolution and field of assembled to get the final 3D sCT. Directly training a full-3D
view of the MR image. Each MR image was further corrected DCNN model is infeasible due to limitations in GPU memory
for intensity non-uniformity using the N3 bias field correc- of commodity GPU cards and due to limited training data in
tion algorithm.43 All MR images were then histogram- this study. It may also be unnecessary since a 2D slice already
matched to a randomly chosen template to help standardize contains rich contextual information.
image intensities across different patients using the method Many different CNN models have been proposed in the
of Cox et al.44 computer vision literature, and their architectures can be very
To better evaluate the accuracy of sCT estimation, a binary flexible. In this work, we build upon recent developments in
head mask was automatically derived from each MR image semantic image segmentation where a deep CNN model can
to separate the head region from the non-anatomical, back- be trained from end-to-end to directly produce a dense label
ground region of the image. This was achieved by applying map for object segmentation in a 2D image.39,46–48 In partic-
the Otsu auto-thresholding method45 on each MR image. A ular, we adopt and modify from the U-net architecture that
morphological closing operator was employed to fill in gaps was proposed in Ronneberger et al.,39 and the resulting net-
around the nasal cavities and the ear canals, the largest con- work architecture is shown in Fig. 2.
nected component of which then produced the head mask.
Figs. 1(a) and 1(b) show one example MR image and the
head mask computed.
It should be noted that every patient had a stereotactic head
frame that was used in conjunction with Gamma Knifeâ treat-
ment. The head frame is totally invisible in the MR image but
only present in the CT image. Its position also differs in each
patient. It is infeasible to expect that a model based on the MR
image alone can reliably predict the head frame. To avoid any
adverse impact of the head frame on model training, we artifi-
cially removed the head frame from each CT image by setting
all voxels outside the previously computed head mask region
to a HU of 1000. Figs. 1(c) and 1(d) show axial views of a
CT image before and after the head frame was removed. The
head frame caused streaking artifacts in the CT image of every
patient, as can be seen in these figures. It should be noted that
these random artifacts in the “ground truth” CT images inevi-
tably cause some over-estimation of the error when evaluating FIG. 2. Overall architecture of the proposed sCT DCNN model. Each blue
the accuracy of the predicted sCTs. box represents a (3 9 3) convolutional layer (with a rectified linear unit as
the activation function). Each red box denotes a max-pooling layer, and each
purple box denotes an un-pooling layer. Each white box denotes a copying
2.B. Deep CNN (DCNN) model for sCT estimation layer. The 2D image size and the depth (number of channels) of the feature
map from each convolution layer are provided at the top of each blue box.
As mentioned in the introduction section, we design a 2D The green box denotes the final 1 9 1 convolution layer that generates the
DCNN model in this work to directly learn a mapping output sCT prediction.

Medical Physics, 44 (4), April 2017


1411 Han: Deep learning synthetic CT generation 1411

Similar to Ronneberger et al.39 (see also Noh et al.47 and The max(0,) operation that is applied element-wise on the
Badrinarayanan et al.48), the model can be seen as consisting convolution output corresponds to the use of a Rectified Lin-
of two main parts: an encoding part (left half) and a decoding ear Unit (ReLU) as the nonlinear activation function. For the
part (right half). The encoding part behaves as traditional first convolutional layer, the input X is the MR image itself.
CNNs that learn to extract a hierarchy of increasingly com- For subsequent layers, the input comes from the output fea-
plex features from an input MR image. The decoding part ture map of the previous layer.
transforms the features and gradually reconstructs the sCT The number of filters in each convolution layer is predeter-
prediction from low to high resolution. The final output of mined, whereas the weights and the biases Wk0 s and bk0 s are
the network is a 2D image with the same size as the input the free parameters. A key to the success of CNNs is their
image. A key innovation from Ronneberger et al.39 that we ability to learn the weights and biases of individual feature
borrow here is to introduce direct connections (shown as maps to get data-driven, task-specific feature extractors
white arrows in Fig. 2) across the encoding part and the instead of relying on a fixed set of hand-crafted features. Fol-
decoding part so that high resolution features from the first lowing the VGG model, we use 3 9 3 filters for all convolu-
part can be used as extra inputs for the convolutional layers in tional layers in the encoding path. It should be noted that the
the second part. This design makes it easier for the decoding size 3 9 3 only refers to the spatial size of each convolution
part to generate high resolution predictions. These short-cuts filter. Each weight is a vector that has the same number of
also make the model more flexible. For example, the model components as the number of channels in the input. Suppose
can automatically learn to skip coarse level features (at bot- there are R input channels, the total number of free parame-
tom of the network) if high resolution features (at top of the ters in a convolution layer is thus equal to K 9 3 9
network) are sufficient to produce accurate CT predictions. 3 9 R + K. We apply zero-padding when performing the
Different from Ronneberger et al.,39 we design the encod- convolution so that the output size (spatial) stays the same as
ing part to follow the same architecture as the popular VGG the input. The 2D size and the number of channels of the out-
16-layer net model49 that was proposed for image classifica- put feature map of each convolutional layer are shown in
tion. We keep the first 13 convolution layers but remove the Fig. 2 on top of the blue boxes.
three-fully connected layers. Removing the fully connected It should be clear from Fig. 2 that a lot of computer mem-
layers reduces the number of parameters by 90% (from 134 ory is needed to store the feature maps, since each feature
million to 14.7 million), which makes the final model easier map has tens or hundreds of channels and each channel has a
to train. Besides, the fully connected layers correspond to size directly proportional to the input image size. The mem-
global image features that are critical for image classification ory requirement is one limiting factor that prevents directly
tasks but not very relevant for dense pixel-wise prediction. building a 3D DCNN model, especially since the number of
By duplicating an existing architecture, we can initialize the filters or channels also needs to be increased to extract fea-
feature extraction part of our model by copying existing VGG tures in 3D. In addition, a 3D DCNN model will have more
model weights that were trained on a very large set of non- parameters and hence require more training data.
medical image data (i.e., 1.3 million natural images consisting After performing 2 to 3 convolutional layers, a max-pool-
of 1000 different object categories as explained in Simonyan ing operation layer is applied to reduce the spatial resolution
et al.49). Although the VGG model was trained on totally dif- of the feature maps so that consequent convolutional layers
ferent images, low- to mid-level features learned from a CNN can learn features of larger image context. The pooling opera-
model are quite generic and initializing from a well-trained tion also helps to generate features that are invariant to local
model is an effective way to help get better solutions for a perturbations of input images. The VGG model uses max-
new task with limited training data as we face here. In the pooling with a 2 9 2 window and stride 2 (non-overlapping
next, we explain individual components of the network in window). The operation consists of taking the maximum fea-
Fig. 2 in more details. ture value over non-overlapping sub-windows of size 2 9 2
The encoding part applies a series of convolutional and within each feature channel. This can be formalized as
pooling layers to extract a hierarchy of features. The convolu- follows:
tional layer is the core building block of a CNN model. Each
convolutional layer performs 2D convolutions of its input zk;i;j ¼ maxp;q2½0;1Š2 hk;2iþp;2jþq ;
with a set of filters, the results of which are passed through a
nonlinear activation function. Mathematically, the operation where (i, j) denotes the spatial index of the output feature
can be expressed as follows: map and k is the channel index. The max-pooling operation
shrinks the size of the feature map by half in each spatial
hk ¼ maxð0; Wk  X þ bk Þ; k 2 ½0; K 1Š;
dimension. This is clearly seen in Fig. 2. It should be noted
where Wk and bk represent the weights and bias of the k-th fil- that there are no free parameters to learn for the max-pooling
ter, respectively, and ‘*’ denotes the convolution operation. layer. But the locations where the maximum value is achieved
The subscript k, k 2 [0, K 1], denotes the index of the fil- in each pooling window need to be saved for later unpooling
ter in the set of total K filters. X denotes the input, and hk is operation.50 This information can be stored as a binary mask
the output of the k-th filter, which is also known as the k-th with the same size as the input feature map, as illustrated in
channel of the output feature map of the convolutional layer. Fig. 3.

Medical Physics, 44 (4), April 2017


1412 Han: Deep learning synthetic CT generation 1412

corresponding CT images {Yi}, we use the Mean Absolute


Error (MAE) as the loss function:
1 XN
LðhÞ ¼ jjYi FðXi ; hÞjj;
N i¼1

where N is the number of training images (2D in this study).


Using MAE (l1-norm) as the loss function makes the learning
more robust to outliers in the training data, such as noise or
other artifacts in the images or due to imperfect matching
between MR and CT images.

2.C. Model implementation details


The proposed pseudo-CT DCNN model is implemented
FIG. 3. Illustration of pooling and unpooling operations. The pooling layer using the publicly available Caffe package51 with a newly
reduces the input size by half in each spatial dimension, and the unpooling
layer doubles the input size. The pooling mask saved during the pooling oper-
added input layer in order to handle 16-bit image data. The
ation is used to guide the unpooling computation. [Color figure can be MAE loss function is also implemented. All computations
viewed at wileyonlinelibrary.com] are performed on GPU.
The DCNN model is trained using backpropagation52
The decoding part of our DCNN model is chosen to be a with the Adam stochastic optimization method53 as imple-
mirrored version of the encoding part as similar to Ron- mented in Caffe, which offers faster convergence than stan-
neberger et al.39 and Noh et al.,47 with an extra 1 9 1 convo- dard stochastic gradient descent methods. The stochastic
lutional layer added in the end that is used to map each optimization method randomly selects a subset of training
64-component feature vector from the previous layer to a CT samples (each sample consists of a MR slice and the corre-
number. Contrary to the feature extraction part that uses max- sponding CT slice) at each iteration of the optimization,
pooling layers to gradually reduce the spatial resolution of which is known as a mini-batch. Larger mini-batches are
the feature maps, the decoding part applies unpooling layers often preferred in order to reduce variances in gradient esti-
to propagate information from coarser to fine resolution. The mation and to fully leverage the parallel computing power
unpooling layer can be considered as performing a reverse of the GPU. But the mini-batch size is also limited by the
operation of the corresponding max-pooling layer. It doubles available GPU memory. We use an effective mini-batch size
the spatial size of its input feature map, and copies the feature of 48 for model training, as permitted by the GPU card
value directly from the input to the output at the maximal used in this study. As mentioned earlier, the model parame-
position for each pooling window using the recorded binary ters for the encoder part of the network are initialized using
mask as illustrated in Fig. 3. The rest of the output is filled corresponding weights from a pretrained 16-layer VGG
with zeros. model as available from the Caffe ModelZoo repository.* It
The output of an unpooling layer is very sparse. The sub- is noted that the first layer of the VGG model expects 3-
sequent convolutional layers then transform the sparse output channel color images as input. We simply sum the VGG
into denser representations through convolution operations. weights over the color channels to get the initial weights for
Note that each of these convolutional layers in the decoding our model for this layer. The weights in the decoding part
part has its own set of learnable parameters that are indepen- of the network are initialized randomly with zero-mean
dent of the weights in the encoding part. As suggested in Gaussian distribution N(0, 1e 4),52 and the bias parameters
Ronneberger et al.,39 to make it easier to reconstruct image are initialized to all zeros. To accelerate training, batch nor-
details, extra connections are made that copy over high reso- malization is performed after each convolutional layer to
lution features from the encoding part and combine them reduce internal covariant shift.54 Simple data augmenta-
with the unpooling output before feeding to the subsequent tion36 is also performed to artificially increase the number
convolutional layers, as shown Fig. 2. Putting the encoding of training data during model training, which applies a ran-
and decoding parts together, the complete network has dom translation of up to 20 pixels in each spatial dimension
27 convolutional layers in total and about 34.9 million for each pair of MR and CT images or randomly flips the
parameters. images.
The full network can be considered as representing a com-
plex end-to-end mapping function that transforms an input
2.D. Cross-validation of DCNN model
MR image to its corresponding CT image. Learning the end-
to-end mapping function requires the estimation of network The performance of the proposed DCNN method is evalu-
parameters h = {W1, b1, W2, b2, . . .}, which is achieved ated using a sixfold cross-validation procedure. The 18 sub-
through minimizing a loss or prediction error between the jects are randomly divided into six equal-sized groups. At
predicted images F ðX; hÞ and the corresponding ground truth
CT images Y. Given a set of MR images {Xi} and their *https://github.com/BVLC/caffe/wiki/Model-Zoo

Medical Physics, 44 (4), April 2017


1413 Han: Deep learning synthetic CT generation 1413

each time, one group is retained as the test set, and the the atlas-based method took about 10 min on the same
remaining five groups are used as training data to train a GPU.
DCNN model. Once the model is trained, it is applied on
each test subject’s MR image to generate the sCT. 3.B. Qualitative and quantitative evaluation of sCT
The accuracy of the predicted sCT for each subject is eval- images
uated against the real CT using a voxel-wise mean absolute
error computed within the head region: Figures 4 and 5 show cross-sectional views of two repre-
sentative sCT results (corresponding to subjects #5 and #12
1 Xn in Fig. 6) as produced by the proposed DCNN method, along
MAE ¼ jCT ðiÞ pCTðiÞj;
n i¼1
with the corresponding CT and MR images and the difference
where n is the total number of voxels inside the head region map between each sCT and the corresponding ground truth
of the MR. The head region is defined through a binary CT. Results from the atlas-based approach are also shown for
mask that is automatically derived from the MR image as comparison purposes. For most parts of the head region, the
described in Section 2.A. Similarly, the voxel-wise mean DCNN method predicts very accurate CT values as can be
error (ME) and mean squared error (MSE) can also be seen in the figures. Large errors mainly occur at interfaces
calculated: between different tissue types, especially around the borders
of bones and air. This is due to high intensity gradients at
1 XN these areas, but it may also be partially due to non-perfect
ME ¼ ðCT ðiÞ pCTðiÞÞ;
N i¼1 alignment between the MR and ground truth CT images for
and each patient, especially since a linear registration is unable to
fully account for geometric distortions commonly exist in
1 XN
MSE ¼ jCTðiÞ pCTðiÞj2 : MR scans. Large errors are also present at the sinus and mas-
N i¼1
toid process regions where the CT images show fine bony
A fourth metric, the Pearson correlation coefficient is also details whereas the MR images are rather homogeneous.
computed between the CT and sCT for voxels within the head Comparing to the DCNN method, the atlas-based results tend
region. to be noisier and have larger errors around skull and dura
For comparison purposes, we also compute sCT results regions due to atlas registration difficulties.
using an atlas-based sCT generation method that we previ- The MAE values for all 18 subjects are plotted in Fig. 6,
ously developed,55 which is also similar in principle to the which compares the DCNN method with the atlas-based
method of Andreasen et al.20 The method uses the training approach. It is seen that the DCNN method produced smaller
data as atlases and computes deformable image registration MAE than the atlas-based approach for every subject. In par-
to align each atlas MR to the test subject’s MR image. The ticular, 13 of the 18 subjects had an MAE less than 85 HU
registration transformation is applied to deform each atlas’ for the proposed DCNN method, whereas only five for the
MR/CT pair to the test subject. At every voxel of the test sub- atlas-based method.
ject’s MR, a local patch is extracted and compared against The overall statistics of the four quantitative metrics com-
nearby patches in each warped atlas MR image. The top 10 puted over the 18 test subjects are summarized in Table I for
most similar patches are found and the corresponding atlas the two different methods. Both the DCNN and the atlas-
CT numbers are then averaged using the patch similarity as based methods gave unbiased sCT estimations, as evident
weighting factors to get the predicted CT number at each test from a mean ME close to zero. Although the mean ME of the
MR image voxel. DCNN results is slightly larger than that of the atlas-based
results, the difference is not statistically significant with
P ≫ 0.05. For all the other three metrics, the proposed
3. RESULTS DCNN was found to be better than the atlas-based approach,
and the differences were also found to be statistically signifi-
3.A. Computation time
cant (P < 0.05) based on paired t-tests.
Each MR volume has about 160 slices, so a group of 15
training subjects provide 2400 training samples. With the
4. DISCUSSION
data being divided into mini-batches of size 48, it takes 50
iterations to go over all training samples once, which is A DCNN model is proposed in this work for synthetic CT
known as one epoch in neural network training. Empirically, generation using conventional, T1-weighted MR images.
we found that it requires roughly 600 epochs or 30K itera- Although a deep neural network model typically requires a
tions for the model to converge, which takes about 2.5 days large amount of training data, very good performance is
of computation time using a single NVIDIA Titan X GPU achieved with our limited data by making use of an existing,
with 3584 cores and 12 GB memory. Once the model is pretrained model through the principle of transfer learning.
trained, it takes approximately 9 s to process all 160 slices The DCNN method does not require any inter-subject image
of a new MR image to get the 3D sCT result. This leads to registration, either linear or deformable, and directly learns
very efficient sCT generation at test time. In comparison, the mapping from the space of MR images to the

Medical Physics, 44 (4), April 2017


1414 Han: Deep learning synthetic CT generation 1414

MR sCT Real CT Difference Map


FIG. 4. Qualitative comparison of sCTs and real CT for subject #5. The image type that each column represents is indicated at the bottom of the corresponding
column. First column: MR; second column: sCTs (rows 1, 3, and 5 show the DCNN results, and rows 2, 4, and 6 show the atlas-based results); third column: real
CT; fourth column: difference maps (rows 1, 3, and 5 correspond to the DCNN results, and rows 2, 4, and 6 correspond to the atlas-based results). The color bar
is associated with the difference maps. First and second rows: axial slices; third and fourth rows: coronal slices; fifth and sixth rows: sagittal slices.

corresponding CT images. Evaluation results showed that the literature. A small, local patch has limited image information
DCNN method offered significantly better accuracy than an and using raw image intensities of a patch as features may
atlas-based method with patch refinement and fusion. This suffer from large redundancy in the data and reduce the dis-
result is expected since the atlas-based method relies on patch crimination power. On the contrary, the DCNN model auto-
comparison to find similar atlas candidates, as also common matically learns a hierarchy of image features at different
in other atlas- or patch-based methods proposed in the scales and complexity from a full image slice.

Medical Physics, 44 (4), April 2017


1415 Han: Deep learning synthetic CT generation 1415

MR sCT Real CT Difference Map


FIG. 5. Qualitative comparison of sCTs and real CT for subject #12. The image type that each column represents is indicated at the bottom of the corresponding
column. First column: MR; second column: sCTs (rows 1, 3, and 5 show the DCNN results, and rows 2, 4, and 6 show the atlas-based results); third column: real
CT; fourth column: difference maps (rows 1, 3, and 5 correspond to the DCNN results, and rows 2, 4, and 6 correspond to the atlas-based results). The color bar
is associated with the difference maps. First and second rows: axial slices; third and fourth rows: coronal slices; fifth and sixth rows: sagittal slices.

The proposed method produced an overall average MAE Hofmann et al.33 reported an MAE of 100.7 HU for their
of 84.8 HU and an MSE of 188.6 when tested on data from hybrid method that combines pattern recognition and atlas
18 brain tumor patients, which compares favorably with other registration. Uh et al.31 obtained an MSE of 207 using a simi-
reported results in the literature for the head region. For lar approach as Hofmann et al.33 but with more atlases.
example, the fuzzy c-means clustering method of Su et al.9 Johansson et al.15 reported an MAE of 130 HU using a
produced an MAE of 130 HU based on UTE MR sequences. Gaussian mixture regression model. Another Gaussian

Medical Physics, 44 (4), April 2017


1416 Han: Deep learning synthetic CT generation 1416

FIG. 6. The MAEs computed within the head region for all 18 subjects, comparing two different approaches.

TABLE I. Overall statistics of four quality measures: mean absolute error (MAE), mean error (ME), mean squared error (MSE), and the Pearson correlation
coefficient (CC). Average value and standard deviation (r) are shown. The bottom row shows the p-values from paired t-tests of the differences between the two
methods.

MAE ME MSE CC

DCNN 84.8 (r = 17.3) 3.1 (r = 21.6) 188.6 (r = 33.7) 0.906 (r = 0.03)


Atlas-based 94.5 (r = 17.8) 1.5 (r = 24.2) 198.3 (r = 33.0) 0.896 (r = 0.03)
7 4
P-value 4.7e 0.54 7.0e 0.0013

mixture classification method11 obtained an MAE of where the Gaussian process model fitting itself took over an
147 HU. Gudur et al.34 achieved an MAE of 126 HU using a hour when using 12 atlases. Andreasen et al.21 reported a
single atlas and Bayesian regression. The multi-atlas method computation time of 15 h for their atlas-based method with
of Burgos et al.23 produced an MAE of 102 HU. The atlas- patch fusion. The authors later proposed an accelerated but
regression method of Sj€ olund et al. yielded an MAE of approximate patch fusion procedure20 that reduced the com-
113.4 HU. The best MAE reported in the literature was from putation time to 38 min. It was, however, acknowledged that
Andreasen et al.,21 who obtained an average MAE of 85 HU the approximation would increase the MAE by 9 HU.21 Even
but only tested on five subjects. The particular approach21 without any atlas registration and with full GPU acceleration,
consists of atlas registration followed by patch fusion, which another patch-based method of Torrado-Carvajal et al.30 still
is similar in principle to the atlas-based method that we com- took 8.86 min when using 10 atlases.
pared in this study. Of cause, these numbers are not directly A second major advantage of the DCNN method is that it
comparable since the image data used in each work were all can easily accommodate a large number of training data. In
different. fact, it is well known that deep CNN models can greatly ben-
A major advantage of the DCNN method is the fast com- efit from big data due to their high model capacity. We hence
putation time at model deployment. Although training a expect that the accuracy of the proposed DCNN method can
model can take days, the training only need be done once and further improve when more training data become available.
acceleration is possible through the use of multiple GPUs. The only burden from an increased number of training data is
Applying the model to create sCT for each new patient only longer model training time, but the size of the final model
takes a few second on a single GPU. On the other hand, and the test speed will remain the same. On the other hand, it
model-based or atlas-based methods can be rather slow. For may be cumbersome to increase training data for atlas-based
example, the hybrid method of Uh et al.31 was reported to methods or other model-based methods such as the Gaussian
take 2 h when using six atlases and 4.5 h using 12 atlases, Process model, since they require all training data be kept

Medical Physics, 44 (4), April 2017


1417 Han: Deep learning synthetic CT generation 1417

(a) (b) (c) (d)

FIG. 7. Large discrepancies exist between sCT and real CT for subject #17 around the neck area due to large anatomical changes between the real CT and the MR
image: (a) the MR image; (b) the sCT; (c) the real CT; and (d) the difference map.

around all the time. In addition, the computation time of procedure can also be designed such that three DCNN mod-
these methods is often directly proportional to the number of els can be trained, corresponding to the axial, sagittal, and
atlases used. It is also observed that the accuracy of atlas- coronal slice directions of a 3D volume, respectively. Each
based methods can quickly saturate and some atlas selection model can be applied to a new MR volume to get its sCT esti-
procedure is often needed to avoid degradation in accuracy mation. The three results can then be averaged together to get
when the number of atlases is large.20,21,31 the final sCT volume. Evaluation of these possible extensions
Similar to other atlas- or model- based methods20, accu- will also be part of future work.
rate intra-patient alignment of MR and CT images is an In this study, we only consider sCT generation from a sin-
important prerequisite and can become a bottleneck for devel- gle, T1-weighted MR image. It is also straightforward to
oping accurate DCNN models. Misalignments in training extend the model to handle multi-sequence MR images if
data cause inaccuracy in the model since the model will be they are available. Only the first layer of the model need be
trained to make wrong predictions. Inaccurately aligned modified, such that the input slices can have multiple chan-
ground truth MR and CT also negatively influence the valida- nels, with each channel representing one MR sequence. This,
tion study. We noticed that there were often some residue of cause, assumes that the multiple MR sequence images are
deformations in each MR/CT pair in our data after the initial perfectly aligned.
rigid alignment, which might be due to both uncorrected geo- Finally, note that unlike CT, MR image intensity and
metric distortions in the MR images and inaccuracy in rigid image contrast can vary significantly across field strengths
registration. When the CT and MR images are acquired days and scanner types, even if the same imaging sequence is
or weeks apart, there can also be real anatomical changes tak- used. A trained DCNN model is usually only robust to data
ing place. This was the case for subject #17, whose sCT had variations covered in the training data. Thus, a DCNN model
the largest MAE, as illustrated in Fig. 7. Correcting the resi- trained using data from one scanner may not be directly
due deformation is not a simple task since it is difficult to applicable to data from a different scanner or a different
achieve sub-voxel accuracy everywhere for inter-modality imaging sequence. To address this issue, one can apply data
deformable image registration. One potential approach to preprocessing such as the histogram-matching in Section 2.A
improve MR/CT alignment of the training data is to employ to help standardize MR image intensities before training and
the sCT generated as a substitute for the MR so as to convert applying a DCNN model. Alternatively, one can simulate
the inter-modality registration problem into a CT-to-CT regis- contrast changes as a part of data augmentation during the
tration one, for which many more registration methods are model training if a reliable contrast simulation procedure can
available. We plan to evaluate the feasibility of this approach be designed. When multi-sequence MR images are available
in future work. and if it is possible to estimate MR tissue parameters (e.g.,
As mentioned earlier, training a 3D DCNN model is cur- T1, T2, and PD) from these sequences, one can also train a
rently infeasible due to limited GPU memory. There are, how- DCNN sCT model based on MR parameter maps instead of
ever, some possible extensions to the DCNN method the raw MR images. All these topics will be part of our future
described above that may further improve the accuracy. First work.
of all, instead of using a single 2D image slice as model input
and output, it is quite straightforward to extend the model to
5. CONCLUSION
use multiple adjacent slices, which may allow the model to
capture extra image context information. For example, only In this work, we developed a deep CNN method for syn-
the first and the last layers need be modified to take three thetic CT generation from conventional, single-sequence MR
adjacent MR slices as input and output three corresponding images. Experimental results showed that the method pro-
sCT slices. When the model is applied to a new MR image, duced accurate synthetic CT results in near real time and
each non-border MR slice can have three sCT results, which competed very favorably with other state-of-the-art atlas-
can be averaged to get the final result. Such an ensemble pro- based approaches. It can also easily handle large training data
cedure may further improve accuracy. Another ensemble without sacrificing test speed. It is thus a very promising

Medical Physics, 44 (4), April 2017


1418 Han: Deep learning synthetic CT generation 1418

method that can facilitate MRI-only radiation therapy. Future 15. Johansson A, Garpebring A, Karlsson M, Asklund T, Nyholm T.
work will evaluate the method on larger data sets and other Improved quality of computed tomography substitute derived from mag-
netic resonance (MR) data by incorporation of spatial information–po-
anatomical regions. Extensions of the method to further tential application for MR-only radiotherapy and attenuation correction
improve accuracy and handle multi-sequence MR images are in positron emission tomography. Acta Oncol. 2013;52:1369–1373.
also warranted. 16. Johansson A, Karlsson M, Nyholm T. CT substitute derived from MRI
sequences with ultrashort echo time. Med Phys. 2011;38:2708–2714.
17. Navalpakkam BK, Braun H, Kuwert T, Quick HH. Magnetic resonance-
ACKNOWLEDGMENTS based attenuation correction for PET/MR hybrid imaging using continu-
ous valued attenuation maps. Invest Radiol. 2013;48:323.
The author would like to thank Dr. Jens Sj€ olund for pro- 18. Rank CM, H€unemohr N, Nagel AM, R€othke MC, J€akel O, Greilich S.
MRI-based simulation of treatment plans for ion radiotherapy in the
viding the experimental data used in this work. brain region. Radiother Oncol. 2013;109:414–418.
19. Rank CM, Tremmel C, H€unemohr N, Nagel AM, J€akel O, Greilich S.
MRI-based treatment plan simulation and adaptation for ion radiother-
CONFLICTS OF INTEREST apy using a classification-based approach. Radiat Oncol. 2013;8:51.
20. Andreasen D, Van Leemput K, Edmund JM. A patch-based pseudo-CT
The author is a full-time employee of Elekta Inc. approach for MRI-only radiotherapy in the pelvis. Med Phys.
2016;43:4742–4752.
21. Andreasen D, Van Leemput K, Hansen RH. J.a.L. Andersen, J.M.
a)
Author to whom correspondence should be addressed. Electronic mail: Edmund, “Patch-based generation of a pseudo CT from conventional
xiao.han@elekta.com. MRI sequences for MRI-only radiotherapy of the brain”. Med Phys.
2015;42:1596–1605.
22. Arabi H, Koutsouvelis N, Rouzaud M, Miralbell R, Zaidi H. Atlas-
REFERENCE guided generation of pseudo-CT images for MRI-only and hybrid PET–
MRI-guided radiotherapy treatment planning. Phys Med Biol.
1. Schmidt MA, Payne GS. Radiotherapy planning using MRI. Phys Med 2016;61:6531–6552.
Biol. 2015;60:323–361. 23. Burgos N, Cardoso MJ, Guerreiro F, et al. “Robust CT synthesis for
2. Devic S. MRI simulation for radiotherapy treatment planning. Med radiotherapy planning: application to the head and neck region”, MIC-
Phys. 2012;39:6701–6711. CAI 2015. Part II, LNCS. 2015;9350:476–484.
3. Hofmann M, Pichler B, Sch€olkopf B, Beyer T. Towards quantitative 24. Chen S, Quan H, Qin A, Yee S, Yan D. MR image-based synthetic CT
PET/MRI: a review of MR-based attenuation correction techniques. Eur for IMRT prostate treatment planning and CBCT image-guided localiza-
J Nucl Med Mol Imaging. 2009;36:93–104. tion. J Appl Clin Med Phys. 2016;17:1–10.
4. Berker Y, Franke J, Salomon A, et al. MRI-based attenuation correction 25. Dowling JA, Lambert J, Parker J, et al. An atlas-based electron density
for hybrid PET/MRI systems: a 4-class tissue segmentation technique mapping method for magnetic resonance imaging (MRI)-alone treatment
using a combined ultrashort-echo-time/Dixon MRI sequence. J Nucl planning and adaptive MRI-based prostate radiation therapy. Int J Radiat
Med. 2012;53:796–804. Oncol Biol Phys. 2012;83:5–11.
5. Catana C, van der Kouwe A, Benner T, et al. Toward implementing an 26. Dowling JA, Sun J, Pichler P, et al. Automatic substitute CT generation
MRI-based PET attenuation-correction method for neurologic studies and contouring for MRI-alone external beam radiation therapy from
on the MR-PET brain prototype. J Nucl Med. 2010;51:1431–1438. standard MRI sequences. Int J Radiat Oncol Biol Phys. 2015;93:1144–
6. Keereman V, Fierens Y, Broux T, De Deene Y, Lonneux M, Vanden- 1153.
berghe S. MRI-based attenuation correction for PET/MRI using ultra- 27. Schreibmann E. J.a. Nye, D.M. Schuster, D.R. Martin, J. Votaw, T. Fox,
short echo time sequences. J Nucl Med. 2010;51:812–818. “MR-based attenuation correction for hybrid PET-MR brain imaging
7. Martinez-M€ oller A, Souvatzoglou M, Delso G, et al. Tissue classifi- systems using deformable image registration”. Med Phys. 2010;37:2101–
cation as a potential approach for attenuation correction in whole- 2109.
body PET/MRI: evaluation with PET/CT data. J Nucl Med. 28. Siversson C, Nordstr€om F, Nilsson T, et al. Technical Note: MRI only
2009;50:520–526. prostate radiotherapy planning using the statistical decomposition algo-
8. Hsu S-H, Cao Y, Huang K, Feng M, Balter JM. Investigation of a rithm. Med Phys. 2015;42:6090–6097.
method for generating synthetic CT models from MRI scans of the head 29. Sj€olund J, Forsberg D, Andersson M, Knutsson H. Generating patient
and neck for radiation therapy. Phys Med Biol. 2013;58:8419–8435. specific pseudo-CT of the head from MR using atlas-based regression.
9. Su K-H, Hu L, Stehning C, et al. Generation of brain pseudo-CTs using Phys Med Biol. 2015;60:825–839.
an undersampled, single-acquisition UTE-mDixon pulse sequence and 30. Torrado-Carvajal A, Herraiz JL, Alcain E, et al. Fast patch-based
unsupervised clustering. Med Phys. 2015;42:4974–4986. pseudo-CT synthesis from T1-weighted MR images for PET/MR attenu-
10. Zaidi H, Montandon M-L, Slosman DO. Magnetic resonance imaging- ation correction in brain studies. J Nucl Med. 2015;57:136–144.
guided attenuation and scatter corrections in three-dimensional brain 31. Uh J, Merchant TE, Li Y, Li X, Hua C. MRI-based treatment planning
positron emission tomography. Med Phys. 2003;30:937–948. with pseudo CT generated through atlas registration. Med Phys.
11. Zheng W, Kim JP, Kadbi M, Movsas B, Chetty IJ, Glide-Hurst CK. 2014;41:051711.
Magnetic Resonance-Based Automatic Air Segmentation for Generation 32. Chen Y, Rubin BG, Lee YZ, Shen D. Probabilistic air segmentation and
of Synthetic Computed Tomography Scans in the Head Region. Int J sparse regression estimated pseudo CT for PET/MR attenuation correc-
Radiat Oncol Biol Phys. 2015;93:497–506. tion. Radiology. 2015;275:562–569.
12. Edmund JM, Kjer HM, Van Leemput K, Hansen RH, Andersen JA, 33. Hofmann M, Steinke F, Scheel V, et al. MRI-based attenuation correc-
Andreasen D. A voxel-based investigation for MRI-only radiotherapy of tion for PET/MRI: a novel approach combining pattern recognition and
the brain using ultra short echo times. Phys Med Biol. 2014;59:7501– atlas registration. J Nucl Med. 2008;49:1875–1883.
7519. 34. Gudur MSR, Hara W, Le Q-T, Wang L, Xing L, Li R. A unifying proba-
13. Kapanen M, Tenhunen M. T1/T2*-weighted MRI provides clinically rel- bilistic Bayesian approach to derive electron density from MRI for radia-
evant pseudo-CT density data for the pelvic bones in MRI-only based tion therapy treatment planning. Phys Med Biol. 2014;59:6595–6606.
radiotherapy treatment planning. Acta Oncol. 2013;52:612–618. 35. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–
14. Korhonen J, Kapanen M, Keyril€ainen J, Sepp€al€a T, Tenhunen M. A dual 444.
model HU conversion from MRI intensity values within and outside of 36. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with
bone segment for MRI-based radiotherapy treatment planning of prostate deep convolutional neural networks. Advances In Neural Information
cancer. Med Phys. 2014;41:011704. Processing Systems. 2012;25:1097–1105.

Medical Physics, 44 (4), April 2017


1419 Han: Deep learning synthetic CT generation 1419

37. LeCun Y. L.o. Bottou, Y. Bengio, P. Haffner, “Gradient-based learning 46. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. “Semantic
applied to document recognition”. Proc IEEE. 1998;86:2278–2323. image segmentation with deep convolutional nets and fully connected
38. Zhang W, Li R, Deng H, et al. Deep convolutional neural networks for CRFs”, arXiv. preprint. 2014;1412:1–14.
multi-modality isointense infant brain image segmentation. NeuroImage. 47. Noh H, Hong S, Han B. Learning deconvolution network for
2015;108:214–224. semantic segmentation. Proc. Int. Conf. Comp. Vis. 2015:1520–1528.
39. Ronneberger O, Fischer P, Brox T. “U-Net: convolutional networks for 48. Badrinarayanan V, Kendall A, Cipolla R. “SegNet: a deep convolutional
biomedical image segmentation”, MICCAI 2015. Part III, LNCS. encoder-decoder architecture for image segmentation”, arXiv. preprint.
2015;9351:234–241. 2015;1511:1–14.
40. Roth H, Lu L, Liu J, et al. Improving computer-aided detection using 49. Simonyan K, Zisserman A. “Very deep convolutional networks for
convolutional neural networks and random view aggregation. IEEE large-scale image recognition”, arXiv. preprint. 2014;1409:1–14.
Trans Med Imaging. 2016;35:1170–1181. 50. Zeiler MD, Fergus R. “Visualizing and understanding convolutional net-
41. Kooi T, Litjens G, van Ginneken B, et al. Large scale deep learning for works”, ECCV 2014. LNCS. 2014;8689:818–833.
computer aided detection of mammographic lesions. Med Image Anal. 51. Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture
2017;35:303–312. for fast feature embedding. Proc. ACM Int. Conf. Multimedia. 2014:
42. Van Nguyen H, Zhou K, Vemulapalli R. “Cross-domain synthesis of 675–678.
medical images using efficient location-sensitive deep network”, MIC- 52. LeCun YA, Bottou L, Orr GB, M€uller KR. Efficient backprop. Neural
CAI 2015. Part I, LNCS. 2015;9349:677–684. Networks: tricks of the trade Springer. 1998;1524:9–48.
43. Sled JG, Zijdenbos AP, Evans AC. A nonparametric method for auto- 53. Kingma D, Ba J. Adam: a method for stochastic optimization. Proc. Int.
matic correction of intensity nonuniformity in MRI data. IEEE Trans Conf. Learning Representations. 2014:1–13.
Med Imaging. 1998;17:87–97. 54. Ioffe S, Szegedy C. Batch normalization: accelerating deep network
44. Cox IJ, Roy S, Hingorani SL. Dynamic histogram warping of image training by reducing internal covariate shift. arXiv preprint 1502.03167,
pairs for constant image brightness. Proc. Int. Conf. Image Proc. 1-11; 2015.
1995;2:366–369. 55. Han X. An efficient atlas-based synthetic CT generation method. Med
45. Otsu N. A threshold selection method from gray-level histograms. IEEE Phys. 2016;43:3733.
Trans. Syst. Man Cybern. 1979;9:62–66.

Medical Physics, 44 (4), April 2017