You are on page 1of 15

A Quality Control Algorithm for Filtering SNPs in GWAS

Monnat Pongpanich (mpongpa@ncsu.edu) December 7, 2009

List of acronyms
MAF: minor allele frequency MSP: missing proportion HWE: Hardy Weinberg Equilibrium cs: affected individuals cn: unaffected individuals all: combined affected and unaffected individuals

Background
This program filters SNPs for quality control purpose based on SNPs QC variables value. It requires one input dataset and will give a list of filtered SNP as the final output. Input: The input dataset has N rows by 8 columns. Each row represents SNP and the 8 columns are 1) SNP rsid 2) MSPcs 3) MSPcn 4) MSPall 5) MAFcs 6) MAFcn 7) MAFall 8) HWE P value respectively. The header for the input dataset is required and the column must be in the order mentioned above. Each column is separated by space. An example of input file is given in input folder. Output: The output has one column: filtered SNP rsid.

Running
This program can be run in two modes: step by step and batch mode. To run the program type ./main Rbindir Rsourcedir Cbindir Rbindir is the directory of R bin folder Rsourcedir is the directory of Rsource folder of this program Cbindir is the directory of Cbinary folder of this program For example, at the program installed path,

Type A if you want to run in step by step mode. Type B if you want to run in batch mode. A. Step by step mode has 11 steps

Step 1: Type 1 to run this step. This step calculates log(MAF pool), MSPcs/MAFcs, and MSPcn/MAFcn. Zero MAF will be replaced with minimum of MAFall besides 0. Input parameters: 1) Full path of input file 2) Directory that will keep the output files from this step

Output: 1) preinitial.txt file has 3 extra calculated columns from the input file. Step2: Type 2 to run this step. This step will pre-clean SNPs that has HWE P value less than a specified cutoff value.

Input parameters: 1) Full path of input file 2) Directory that keeps the output files from step 1 3) Directory that will keep the output files from this step 4) Cutoff for HWE P value

Output: 1) freqall.txt has one columns: MAFall which zero value has not been replace. 2) freqnozero.txt has three columns: MAFcs, MAFcn, MAFall which zero value has been replaced. 3) initial.txt has eight columns: id, rsid, MSPcs, MSPcn, MSPall, log(freqall), MSPcs/MAFcs, MSPcn/MAFcn

Step 3: Type 3 to run this step. This step will apply PCA on 6 variables: MSPcs, MSPcn, MSPall, log(MAFall), MSPcs/MAFcs, MSPcn/MAFcn Input parameter: 1) Directory that keeps the output files from step 2 2) Directory that will keep the output files from this step

Output: 1) pca.txt has three columns: id, pc1, pc2 2) importance.txt gives the standard deviations of the principal components 3) rotation.txt gives the matrix of variable loadings

Step 4: Type 4 to run this step. This step calculates k-th nearest neighbour distance for each SNP. Input parameters: 1) Directory that keeps the output files from step 3 2) Directory that will keep the output files from this step 3) Minimum number of points in the neighbourhood (MinPts; four is recommended)

Output: 1) fourdist.txt has one column: k-th nearest neighbour distance

Step 5: Type 5 to run this step. This step generates k-nearest neighbour graph used for initial guess of changing point step.

Input parameter: There are two r values that can be calculated: suggested r and upper r. Suggested r is the value we recommend to use in DBSCAN and upper r is the upper limit of r. User can fine-tune r value but we recommend not to go beyond upper r value. 1) Directory that keeps the output files from step 4 for finding suggested r or directory that keeps the output file from step 8 for finding upper r 2) Directory that will keep the output files from this step 3) Enter 1 if you want to calculate suggested r in step 6 Enter 2 if you want to calculate upper r in step 6

Output: 1) k-th nearest neighbour graph (fourdist.tif for suggested r and fourdistupper.tif for upper r).

Step 6: Type 6 to run this step. This step will calculate r and change point. Input parameters: 1) Directory that keeps the output files from step 4 for finding suggested r or directory that keeps the output file from step 8 for finding upper r 2) Directory that will keep the output files from this step 3) In some cases, there are two curves in the graph. We only want the change point in the second curve; therefore, we exclude points around the first curve region. For example, in the figure below, the starting point is 270,000

4) In some cases, there are outlier points at the tip. We exclude them so that it will not influence the change point answer. For example, in the following figure, the end SNP is 472,000 (with the staring SNP is 270,000)

The following parameters are initial guess

Slope of second line

Slope of first line

y-intercept

Change point

5) Initial guess for y-intercept 6) Slope of the first line

7) Slope of the second line 8) Change point 9) sd 10) Enter 1 if you want to calculate suggested r Enter 2 if you want to calculate upper r

Output: 1) epssuggested.txt (for suggested eps) or epsupperbound.txt (for upper r) file that contain value of yintercept (alpha), slope of line 1 (beta1), slope of line 2 (beta2), change point (cp), sd (sd), r value (eps), the value of function to be minimized (objective), an integer code: 0 indicates successful convergence (convergence), number of iteration performed (iterations).

User must make sure that it has successful convergence (0) before continuing the next step. If the
program does not converge, change the initial guess.

Step 7: Type 7 to run this step. This step run DBSCAN Input arguments: 1) Directory that keeps the output files from step 3 2) Directory that will keep the output files from this step 3) r value (answer from step 6) 4) Minimum number of points in the neighbourhood (four is recommended)

Output: 1) dbscan.txt has two columns: snpid and cluster id

Step 8: Type 8 to run this step. This step will prepare the file needed for calculating upper r. Then to find upper r, user has to re-run step 5 and 6 but enter 2 for the last parameter. Note: If user does not want to find upper r. This step can be omitted. Input parameters: 1) Directory that keeps the output files from step 6

2) Directory that keeps the output files from step 4 3) Directory that will keep the output files from this step

Output: 1) fourdistupper.txt has one column: k-th nearest neighbour distance of point that is in the right of change point in step 6

Step 9: Type 9 to run this step. This step will prepare the file needed for monotonicity step. Input parameters: 1) Directory that keeps the output files from step 7 2) Directory that keeps the output files from step 2 3) Directory that will keep the output files from this step 4) Number of people for calculating epsilon in cases. Epsilon is calculated from #of people/# of cases. For example, if user enter 1 and there are 1944 cases, the epsilon for cases is 1/1944 5) Number of people for calculating epsilon in controls. Epsilon is calculated from #of people/# of controls. For example, if user enter 1 and there are 3004 controls, the epsilon for control is 1/3004 6) Number of people for calculating epsilon in pool. Epsilon is calculated from #of people/( # of cases + # of controls). For example, if user enter 1 and there are 1944 cases, 3004 controls, the epsilon for pool is 1/(1944+3004)

We recommend enter 1 for all epsilon 7) Number of cases in the study 8) Number of controls in the study

Output: 1) largestcluster.txt has 10 columns: id, rsid, MSPcs, MSPcn, MSPall, MAFcs, MAFcn, MAFall, MSPcs/MAFcs, MSPcn/MAFcn. It contains only SNPs that are in the largest cluster from DBSCAN 2) othercluster.txt has 10 columns: same as largestcluster.txt but contains SNPs which are not kept in largestcluster.txt

Step 10: Type 10 to run this step. This step finds which SNP should also be kept for monotonicity. Input parameters: 1) Directory that keeps the output files from step 9 2) Directory that will keep the output files from this step

Output: 1) bringback.txt has one column: id of snp that will also be kept

Step 11: Type 11 to run this step. This step finalize the answer: filtered SNPs Input parameters: 1) Directory that keeps the output files from step 7 2) Directory that keeps the output files from step 10 3) Directory that will keep the output files from this step

Output: 1) filteredSNPs.txt has one column: rsid of kept SNPs Type 0 to go back to the main menu

B. Batch mode Batch mode will ask for full path of input file, the directory that will keep all files generated from this program. It will then stop at a certain point to ask for required parameters needed at that moment, e.g., HWE P value, number of neighbour (MinPts). The explanation for these parameters is in step by step section. It will automatically exit the program when it is done. Note: please make sure before entering r value that the program converges.

You might also like