May
09

Alzheimer’s Books: Tangles and The Little Girl in the Radiator

I read two books on Alzheimer’s from the caretaker’s perspective:

  1. The Little Girl in the Radiator
  2. Tangles

Tangles and The Little Girl in the Radiator are written from the perspective of the caregiver. Tangles is written by Sarah Leavitt about her mother. The main caretaker is her father.

In The Little Girl in the Radiator, the son (Martin Slevin) is the main caretaker.

I enjoyed Radiator Girl more because of its humor. Tangles is a comic book and I’m not a big fan of comic books. Both were good; tears were flowing at the end.

Here are some things that stood out for me:

Loss of Appetite
In both books, the patient loses appetite, and so is in real danger of getting weaker. This may be in part because the patient’s sense of smell is gone, and so flavor and taste comes from the tongue. As a result, both patients loved sweet food (sweet candies in Tangled and chocolate biscuits in the Little Girl in the Radiator). It’s bittersweet the lengths at which the patients would eat sweet food. In Tangled, the mother eats sweet candies without taking the wrapper off, while in the Little Girl in the Radiator, the mother buys 50 packets of chocolate biscuits.

Hygiene
Managing the bodily functions of an Alzheimer’s patient is hard. In Tangles, the mother’s hygiene has deteriorated to a point where there are bits of feces and the patient is oblivious. The author mentions multiple incidents, and I think the mother even pooed in her underwear. I’m not sure why they didn’t put the mother in diapers earlier.

There are some nice pictures of raised toilet seat which only a comic book could capture nicely.

I wonder if a bidet toilet would have helped the patient remain clean. I wonder if there are bidet toilets for elderly. You’d have to be used to using one too, and I know I’m not used to it.

Also in Tangles, keeping the mother clean is a challenge. The mother can’t brush her own teeth so her breath stinks. The comic book pictures a bathtub, which must be difficult to get in and out of. I wonder if a shower stall is easier for someone with Alzheimer’s. Hygiene seems a challenge and they hire two caretakers for the mother when the father is away at work or needs a break.

Pets offer love
In both books, pets seem to help. In Tangles, Sarah’s mother recognizes the cat instead of her own daughter, In The Little Girl in the Radiator, the patient adopts an unruly dog for a few months, and is the one person who can relate to the dog. There are some pretty touching stories about the dog stealing the Sunday roast, but at then finding the lost patient!

Lost concept of time / Waking up at night
In both books, the patient can’t sleep and wakes up in the middle of the night, expecting a hair appointment or having a full-on conversation. In Tangles, the husband turns on the TV all night to entertain his affected wife. This can be very hard on the caretaker because the caretaker needs rest too.

Interesting Observations by the Author of The Little Girl in the Radiator:
Suggestive and Easily Duped
This may depend on how social the patient is. In the Little Girl in the Radiator, the author describes how his mother is easily swindled by door-to-door salesmen. Thankfully, he had power of attorney and was able to dispute charges which the mother had agreed to.
Know the laws in your state and country and assert them! Get power of attorney so that any legal agreement that the patient enters into isn’t binding.
Another example of how the patient is impressionable, is that whatever she watches on television becomes reality for the patient.

Repetitive actions that speak to a larger insecurity
Martin Slevin was astute enough to interpret deeper meaning from his mother’s obsessions. The title “Little Girl in the Radiator” is a fixation that his mother has — that there’s a little girl trapped in the radiator. Martin learns from his mother in her last days that ‘she’ is the little girl in the radiator — that she feels trapped and can’t come out. There are other repetitive patterns such as asking to go to the hairdresser (the mother was always looking her best and well-coiffed) and locking her son out of the house when she felt insecure or that he was a threat.

Apr
16

ApoE4 and increased risk of Alzheimer’s

My significant other (SO) has a good chance of getting Alzheimer’s, based on his family history and genetics (he’s ApoE4 homozygous). We discovered this when he got his DNA tested.

So my goal (that sounds really ambitious – maybe I should say “aim”? Trying again…)

My aim is to find preventive measures for Alzheimer’s, specifically for ApoE4 carriers. My partner is 42 years old right now. Can we prevent Alzheimer’s or delay it? If we start early, maybe he can have another 5 years of sanity. I know taking care of him will be exhausting. I’m trying to spend some time now to see if I can delay/prevent his dementia.

Yes, drugs are in development. Even if a drug gets to the market by the time he needs it, it could have side effects, be too pricey, or may not work effectively on him.

So if there are natural, preventive measures, then why not give it a try? I’m going to read the research, and using my biology/genetics/disease background, see if it makes sense at a biology/molecular level. Maybe we can do an experiment!

Let’s start at the beginning — how did we find out that my SO is at risk for Alzheimer’s?

The regulations are always changing on whether a person has the right to know if they’re at risk for Alzheimer’s. In 2017, FDA permitted 23andMe to provide Alzheimer’s predictions[1 , 2]

My SO had been tested prior to this approval. We were able to still get the Alzheimer’s risk and others by

  1. Get the raw 23andMe data.
    (DNA never changes, the information is always there. Regulation was preventing customers from this information)
  2. Getting the interpretations from Promethease, a 3rd party service for $5
  3. APOE4 genotype from Promethease

Early 23andMe customers got Alzheimer’s predictions (2007-2013), but he ordered his test during the FDA ban period.

If you want to save money and not pay 23andMe’s $199 ancestry+health offering, you could pay for 23andMe’s $99 ancestry service, and then pay $5 to Promethease for the disease risk predictions. The diseases in the 23andMe report are well-studied, so no matter if you use Promethease or 23andMe, the results should be the same. (No guarantees as we haven’t paid for the $199 service, so I can’t check it.)

However, it’s not easy to look at Promethease’s report. The easiest way is to look at the interactive report (report_ui2.html) and see how many copies of APO-ε4 he had. ε4 is bad, while ε3 and ε2 are OK. Unfortunately, my SO had 2 copies of APO-ε4

Honestly, unless you have a strong background in genetics or Alzheimer’s, I’d pay the extra money to get the 23andMe report, because this is the type of stuff you don’t want to misinterpret. (I saw the early reports from 2007-2013, and they’re much clearer than Promethease.)

Still, looking specifically at Alzheimer’s and Promethease’s $5 report, I could tell that Promethease was looking at the right genetic mutations. Because he has two copies of the ‘bad’ allele, his risk for Alzheimer’s disease is increased 12x. He also has a family history of Alzheimer’s, so unfortunately, I think it’s only matter a time.

Here are the resources I’ve found so far:

  1. ALZFORUM has a lot of academic papers on ApoE4’s role in Alzheimer’s, which is great!
  2. Apoe4.info – People who have ApoE4 come to share on this forum. I especially like the I like “Our Stories” where people share what diet and exercise regiment they’re trying.

    Community Biomarker Archive is where people share their cholesterol, weight, and other body markers. I wish there were some test results that measure memory/cognitive decline, so people could know if what they’re doing/eating works.

Do you know any other resources out there for ApoE4?

Jan
03

FDA Validation of a PCR Test: Precision (Repeatability & Reproducibility)

Precision measures the consistancy of results. Precision is not accuracy. While accuracy is getting the answer right, precision is about getting the same answer over and over again. You don’t want to get an answer one day, and the next day, because it’s raining or the lab technician didn’t have their cup of coffee, it’s another answer!

A test needs to be accurate and precise. Accuracy was addressed in another post, this post will assess precision.

To assess precision, the lab runs the same sample under different conditions — different operators, machines, days, batches, etc. We’re trying to see that there are no significant differences, otherwise we need to address it. For example, if there’s a difference between machines, then this suggests the machines need to be re-calibrated.

ΔCt from two batches. Batch 1 and 2 look equivalent, as seen by the red boxplot (Batch 1) pairing nicely with a blue boxplot (Batch 2) for a given concentration.

x-axis labels describe the different concentrations below, at, or near the limit of detection (LoD).  FFPE indicates Formaldehyde Fixed-Paraffin Embedded samples.

ΔCt between the two instruments. Instrument 2 gives slightly higher values than Instrument 1, is this significant? The numbers will tell…

ΔCt between the two operators. The boxplots from the 2 different operators are similar  — nice job, guys!


ΔCt measures across the different runs. It’s a little noisy — ANOVA will tell us whether this is significant.

ΔCt measures across the different wells. For the most part, the boxplots at the different concentrations are grouped together.

While a picture is worth a thousand words, are these differences significant?  To explore this, we’ll use ANOVA or analysis of variance.

ANOVA stands for “analysis of variance” and is used to analyze the differences among groups. We expect the sample concentrations to contribute to differences in ΔCt. We hope that other factors like instrument and batch do not. If they do, something is wrong.

Here are some example ANOVA results.

 VariancePrecisionLower 95% CIUpper 95% CIp-value
Sample34.295.8611476.723 *10-141
Operator0.020.14021.130.50
Sample-Operator Interaction0.010.100.090.94

It’s expected that sample concentrations contribute a lot to the variance (p =3*10^-141). The variance that the operators contribute is small (0.02) and not significant p=0.50. There is no interaction between the sample concentration and the operator.

 VariancePrecisionLower 95% CIUpper 95% CIp-value
Sample3.891.971.2554.071.2 * 10-43
Run0.090.30.050.190.001
Sample-Run Interaction0.050.220.030.070.14

The variance that the run contributes is small (0.09) but significant p=0.001.

Here is the code that generates the statistics:

First, label the data by the different effects
In the data frame, we have a column labelled “Run”, and takes on values “Run1” and “Run2”. There is another column labelled “Well” and contains the well values.


effects <- c("Run", "Well", "Batch", "Operator")

Loop through the for loop to do ANOVA for each effect (variable).


for (effect in effects) {
 anova_results <- Anova (aov (deltaCq ~ factor (Sample) * factor ( eval (as.name (effect)) ),  
   data=cleaned_channels), type = "III", singular.ok=TRUE)
  anova_stats <- anova_stat_function (anova_results)

}

The anova_stat_function calculates the metrics seen in the tables:


anova_stat_function <- function (anova_results) {
   anova_df <- as.data.frame(matrix(nrow = nrow (anova_results) ))
   row.names (anova_df) <- row.names (anova_results)

   sum_of_squares_column = 1
   degrees_of_freedom_column = 2
	
   anova_df["Variance"] = anova_results[sum_of_squares_column]/anova_results[degrees_of_freedom_column]
   anova_df["Degrees of Freedom"] <- anova_results[degrees_of_freedom_column]
   anova_df["Precision"] = sqrt (anova_df["Variance"])
   # confidence interval is sum of squares divided by chi square
   chisq97.5 <- qchisq (0.975, anova_results[[degrees_of_freedom_column]])
   chisq2.5 <- qchisq (0.025, anova_results[[degrees_of_freedom_column]])
   anova_df["Lower 95% CI"] = anova_results[sum_of_squares_column]/chisq97.5
   anova_df["Upper 95% CI"] = anova_results[sum_of_squares_column]/chisq2.5

   return (anova_df)
}

Complete code can be found at Github under Rscripts_repeatability.txt and Rscripts_reproducibility.txt

Dec
21

FDA Validation of a PCR Test: Reportable Range

The Reportable Range plan validates the test under many different conditions. 3 RNA input levels (low, medium, high) and the target at 12 different concentrations, ranging from 0.0488% to 100%. By testing 36 different conditions (12*3), we’ll get:

  1. Summary statistics for each fusion concentration and RNA input level
  2. Amplification efficiencies at the different % fusion concentrations
  3. Relationships between:
    1. Texas Red Ct and RNA Input Level (should be positively correlated)
    2. FAM and Texas Red Ct Values (should be positively correlated)
    3. ∆Ct and Texas Red Ct Values (should not be correlated)
    4. FAM Ct and % Fusion Concentration (should be negatively correlated)
    5. TxRd Ct and % Fusion Concentration (should not be correlated)
    6. ∆Ct Value and % Fusion Concentration(should be negatively correlated)
  4. Multiple Linear Regression

Many relationships between x and y being tested; this is to make sure the assay is behaving as expected.

1. Summary statistics for each fusion concentration and RNA input level
For the summary statistics, we’ll just use the summary_function as described in a previous post. The summary_function prints out the common metrics like average, median, std dev, percentiles, etc.

2. Amplification efficiencies at the different % fusion concentrations

The amplification efficiency for the reaction is calculated using the formula:
Amplification Efficiency = (10 ^ (-1 / slope)) -1, (^ denotes “to the power of”).

This is our R function (because we are going to pass in different linear models based on RNA input or fusion concentration:

calc_amplification_efficiency <- function (lm) {
slope <- lm$coefficients[2]
amplification_efficiency <- (10 ^ (-1 / slope)) -1
return (amplification_efficiency)
}

all.lm <- lm (FAM.Cq ~ log10(Fusion_Concentration), data=channels)
print (calc_amplification_efficiency (all.lm))

Ideally your amplification efficiency is close to 1.

3. Relationships Between Two Variables
In this section, we fit a lot of linear regressions to check that the data behaves as expected. We generate a lot of plots, but it's really the slope of the linear regression that tells us whether the data behaves as expected.

Since we're looking at so many variables, a for-loop was necessary. This is a for-loop that looks at each concentration across the different measures (FAM, ΔCt, etc).

params_to_plot = c("deltaCq", "FAM.Cq", "Target_PTPRK_Ct")
conc <- c(100, 50, 25, 12.5, 6.25, 3.125, 1.563, .7810, .3906 , .1953, .0977, 0.0488 )

for (param in params) {
for (fus_conc in conc) {
subset_df = channels[channels$Fusion_Concentration == fus_conc,]
subset_df.lm <- lm ( eval (parse( text=param)) ~ TexasRed.Cq, data=subset_df)
confidence_interval <- predict (subset_df.lm, interval='confidence', level=0.95)
print (c("% fusion", fus_conc) )
print (coef (subset_df.lm))
print (confint(subset_df.lm, 'TexasRed.Cq', level=0.95))
}
}

Here are the linear regressions between FAM and Texas Red at various concentrations:

% ConcentrationInterceptSlope95% Confidence Interval around Slope
1001.490.94(0.91, 0.98)
501.530.99(0.95, 1.01)  
252.221(0.93, 1.06)
12.53.590.99(0.92, 1.04)
6.255.330.95(0.89, 1.01)
3.1254.041.06(0.97, 1.14)
1.56256.930.99(0.85, 1.12)
0.781253.961.17(0.96, 1.39)
0.39069.461.02(0.75, 1.30)
0.195318.670.72(0.16, 1.27)
0.097710.661.15(-0.52, 2.83)
0.048851.82-0.48(-25.19, 24.23)

The corresponding figure is below:

FAM Ct and Texas Red Ct Values

At higher concentrations, the slope is close to 1 so there is a strong correlation, but the slope deviates from 1 at lower concentrations.

Another for-loop in the code generates a whole lot of graphs...

Texas Red Ct and RNA Input Level

∆Ct and % Fusion Concentration

FAM Ct and % Fusion Concentration

Texas Red Ct and % Fusion Concentration

4. Multiple Linear Regression
This multiple linear regression has Texas Red and the fusion concentration as dependent variables.

mult.lin.regression <- lm (formula = deltaCq ~ TexasRed.Cq + log2(Fusion_Concentration) , data=channels)
summary (mult.lin.regression)
confint(mult.lin.regression, level=0.95)

∆Ct should not depend on the RNA input level (Texas Red Cq), so the coefficient should be close to or approximately 0, indicating that there was little relationship between ∆Ct and RNA input level. If the coefficient for log2 (% fusion) is large coefficient, this means % fusion concentration can predict ΔCt.

Dec
11

FDA Validation of a PCR Test: Run Control Specification (Part 5)

The purpose of run control specification is to find what is the acceptable range of values to accept a sample. For example, if a Texas Red value is detected in cycle 38 (Ct 38), this is too late in the PCR cycles and the signal could be due to contamination. Thus the sample is thrown out.

To get the acceptable ranges, the PCR test is run on clinical positive controls (FFPE and fresh frozen). Because Texas Red serves as an internal control, we look at its values to decide whether we accept/reject a sample.

The objective of the Run Control Specification is to find a range of Texas Red values that will deem a sample to be acceptable.

The code can be found on github.

We want to find the acceptable Texas Red range for the clinical samples. We analyze the positive controls and get prediction intervals based on the observed values.

The results of the Texas Red channel at 0.9, 0.95 and 0.99 confidence and 0.9, 0.95 and 0.99 coverage looks like:

 FluorConfidenceCoverage Level2-sided Lower Prediction Interval2-sided Upper Prediction Interval
Positive ControlTexas Red0.990.929.9631.38
Positive ControlTexas Red0.990.9529.9631.38
Positive ControlTexas Red0.990.9929.9631.38
Positive ControlTexas Red0.950.929.9631.38
Positive ControlTexas Red0.950.9529.9631.38
Positive ControlTexas Red0.950.9929.9631.38
Positive ControlTexas Red0.90.930.0431.35
Positive ControlTexas Red0.90.9529.9631.38
Positive ControlTexas Red0.90.9929.9631.38

From these results we choose the broadest range (29.96-31.38) because we want to be liberal accepting samples. For this range, any confidence and coverage level will do except 0.9 confidence 0.9 coverage.

How to get the Prediction Intervals

We make a data frame of the positive controls only (not the negative controls):

positive_control = df[df$Content == 'Pos Ctrl-1' | df$Content == 'Pos Ctrl-2' | df$Content == 'Pos Ctrl-3' | df$Content == 'Pos Ctrl-4', ]

And then a data frame for each channel Texas Red, Cy5, and FAM:
pc_TexRed = positive_control[positive_control$Fluor == 'Texas Red',]
pc_Cy5 = positive_control[positive_control$Fluor == 'Cy5', ]
pc_Fam = positive_control[positive_control$Fluor == 'FAM', ]

In the next section, I’ll show the analysis on Texas Red but this can be applied to all channels.

  1. Outlier removal by the Tukey rules on quartiles +/- 1.5 IQR
  2. pc_TexRed_noOutliers <- outlierKD (pc_TexRed, Cq)

  3. Test if the data looks normal or not:
  4. shapiro.test (pc_TexRed_noOutliers$Cq)
    The results give us:

    Results of Hypothesis Test
    --------------------------

    Alternative Hypothesis:

    Test Name: Shapiro-Wilk normality test

    Data: pc_TexRed_noOutliers$Cq

    Test Statistic: W = 0.9494407

    P-value: 0.001008959

    which means the data is not normal.

  5. The tolerance package calculates prediction intervals.

    Load the library:


    library (tolerance)

    If the data looks normal, use the normtol function

    normtol.int (cleaned_df$Cq, alpha=alpha_num, P= coverage_level, side=2)

    If the data does not look normal, use a nonparametric tolerance function

    nptol.int (cleaned_df$Cq, alpha=alpha_num, P= coverage_level, side=2)

    Since the data is not normal, we'll use nptol.int. This loops through multiple confidence and coverage levels.

    confidence_levels <- c(0.99, 0.95, 0.90)
    nonparametric_coverage_levels <- c (0.90, 0.95, 0.99)

    for (confidence_level in confidence_levels) {
    alpha_num = 1 - confidence_level
    for (coverage_level in nonparametric_coverage_levels) {
    nonparametric <- ddply (all_controls_no_outliers, c('Fluor'), .fun = nptol_function, alpha=alpha_num, coverage_level=coverage_level )

    if (all (is.na (current_nonparametric))) {
    current_nonparametric <- nonparametric
    } else {
    current_nonparametric <- merge (current_nonparametric, nonparametric, all=TRUE)
    }
    } # end coverage_level
    } # end confidence_level

This code generates the results table at the top if this post.

Dec
04

FDA Validation of a PCR Test: Limit of Detection (LoD) (Part 7)

Limit of detection is the sensitivity of the assay — how low a concentration can the test detect?

To test this, the lab did a dilution series over a range of concentrations. When the concentration is low enough, the fusion won’t be detected.

Using the data from the dilution series, we used 4 models to predict the limit of detection.

  1. Linear regression using all data
  2. Linear regression using just the concentrations where data was observed
  3. Logit
  4. Probit

1. Linear regression using all data

My data frame “chan” contains deltaCq values and the Percentage_Fusion.

First do the linear regression:


chan.lm <- lm ( deltaCq ~ log2 ( Final_Percentage_Fusion ), data=chan) summary (chan.lm)

with a prediction interval:

PI <- predict (chan.lm, newdata = new.dat, interval='prediction', level=0.90)

The fusion dilution where the top prediction limit crosses the delta Ct cut-off will be estimated and chosen to be the model's estimate of the C95.

My delta Ct cutoff is 5 (determined from the Accuracy study).

cutoff <- 5

Our linear model on the upper prediction interval is :


upper.lm <- lm (PI[,"upr"] ~ log2(conc) )

y = slope * x + intercept and hence x = (y- intercept)/slope. In our case x is the concentration, and it's what we want. y is delta Ct and we want to know what concentration is predicted to have a delta Ct of 5 (our cutoff).

slope <- upper.lm$coefficients[2]
intercept <- upper.lm$coefficients[1]


C95 = (cutoff - intercept)/slope


conc_LoD = 2^C95

conc_LoD is our limit of detection.

2. Linear regression using just the concentrations where data was observed

To do a linear regression on just some of the data, identify which concentrations to remove.

remove_because_partial_or_no_data <- c(0.0488, 0.0977, 0.1953, 0.3906, 0.781, 1.563)

Remove these concentrations from the data frame:

chan <- chan[!(chan$Final_Percentage_Fusion %in% remove_because_partial_or_no_data),]

and then follow the same steps as in #1.

3. Logit model

First, make a column of whether the PCR product is called as a positive or not. This column is called "CalledPositive".


chan["CalledPositive"] <- 1

If no signal is detected or deltaCq is > 5, then the fusion is not called and we have to assign CalledPositive as false or "0".

chan[is.na(chan$deltaCq), "CalledPositive"] <- 0


chan[!is.na(chan$deltaCq) & chan$deltaCq > 5, "CalledPositive"] <- 0

Run the logit analysis

mylogit <- glm (CalledPositive ~ log2Conc , family = binomial(link = "logit"), data = chan)

To find the sensitivity at 95% based on the logit model, I manually feed in ranges of values until I get something close to 95%.

For example, by inspecting the graph I know that my LoD is somewhere between 0 and 2 so I set


xlower <- 0


xupper <- 2

and then I make a series of points:

temp.data <- data.frame(log2Conc = seq(from = xlower, to = xupper, length.out = 10))

And I input temp.data to get predicted sensitivities:


predicted.data.logit <- predict(mylogit, temp.data, type = "response", se.fit=TRUE)

I then look at predicted.data.logit to see which concentration gives me close to 95%. This can take several iterations until I'm very close to 0.95

4. Probit model

We make use of the "CalledPositive" column defined from the section 3 logit model.

myprobit <- glm(CalledPositive ~ log2Conc , family = binomial(link = "probit"), data = chan)

To find the concentration with sensitivity of 95%, do the same steps as described in 3.Logit model, but use predicted.data.probit instead.


predicted.data.probit <- predict(myprobit, temp.data, type = "response", se.fit=TRUE)

You can find my complete linear regression R scripts and probit and logit scripts on github.

Nov
27

FDA Validation of a PCR test: Accuracy (Part 4)

In the Accuracy study, we will:

  • Set the acceptable working range
  • Set the cutoffs for calling a sample positive or negative
  • Assess how well our assay agrees with an alternative assay

Clinical samples can  be FFPE or fresh frozen. FFPE is usually poorer quality than fresh frozen, so the two sample types are treated separately.

Section 1. Setting the Acceptable Working Range

The “acceptable working range” is based on the observed values from clinical samples. Once the acceptable working range is set, samples that have values outside the working range will be rejected.

To calculate the acceptable working range, we look at the observed values of Texas Red for the wild-type samples and calculate prediction intervals. In R, alpha = (1 – confidence_interval) and P is coverage.

To calculate prediction intervals, we can use either normtol.int or nptol.int:

normtol.int (df$Cq, alpha=alpha_num, P= coverage_level, side=2) # for normally distributed data

nptol.int (df$Cq, alpha=alpha_num, P= coverage_level, side=2) #for data that's not normally distributed

The Shapiro test for normality:

shapiro.test (df$Cq)

showed that the Ct values were non-normal, so the non-parametric nptol.int was used to obtain the table below:

SampleAlphaConfidenceCoverage2-sided Lower2-sided Upper
FFPE0.010.990.9523.3235.13
FFPE0.010.990.9923.2735.51
FFPE0.050.950.924.9932.36
FFPE0.050.950.9524.5534.84
FFPE0.050.950.9923.2735.51
FFPE0.10.90.925.0432.18
FFPE0.10.90.9524.5534.84
FFPE0.10.90.9923.2735.51

Based on the table above, we decide on the acceptable working range. We usually go for the broadest range of values because we don’t want to reject too many samples.

Section 2. Set the cutoffs for calling a sample positive or negative

A sample will be called positive or negative based on its ΔCt value. Based on the plot of the ΔCt values with both positive and negative samples, we pick a cutoff that separates the negative samples from the positive samples.

Based on the figure above, we chose a ΔCt cutoff of 5.  Samples with ΔCt <= 5 will be mutation-positive and samples with ΔCt > 5 will be mutation-negative.

 

Section 3. Setting the Acceptable Working Range

This is fairly easy and doesn’t need complicated statistics.

Original MethodTotal
PositiveNegative
New MethodPositiveaba+b
Negativecdc+d
Totala+cb+dn

Nov
27

FDA Validation of a PCR test: Analytical Specificity (Part 3)

Analytical specificity shows how robust the test is to contamination. Positive controls with the fusion were spiked with EDTA or ethanol at various concentrations. We determined at which concentration the PCR no longer worked.

The figures and results in this blogpost were generated by R code corresponding to “AnalyticalSpecificity_R_scripts.txt” on github

Pictures always help, so these are some results:


EDTA affects FAM at higher concentrations.


EDTA affects Texas Red at higher concentrations.


EDTA doesn’t seem to affect ΔCt at higher concentrations.

From the pictures, it’s obvious EDTA affects FAM and Texas Red at higher concentrations. At what concentration does it start to significantly differ? (We’ll look at the p-values later in this post.)

Experimental Setup

Our PCR data contains in the Content column the groups “Unkn-01”, “Unkn-02″… “Unkn-10” where “Unkn-01” indicates the first concentration and “Unkn-10” indicates the 10th concentration tested.

For ethanol, the 10 concentrations that were tested were 4%, 2%, 1%, 0.5%, 0.25%, 0.125%, 0.063%, 0.031%, 0.016%, 0%.

For EDTA, the 10 concentrations that were tested were 20 mM, 10 mM, 5 mM, 2.5 mM, 1.25 mM, 0.625 mM, 0.313 mM, 0.156 mM, 0.078 mM and 0 mM.

Preliminary steps: Data is uploaded into R and then a whole bunch of cleaning and relabeling (check out the script in github).

We  summarize the statistics for the channels under different conditions

summary_all_data <- ddply (df, c('Fluor', 'Spikein_Level', 'Sample_no_number'), .fun=summary_function)

The results look like:

FluorSpikein_LevelSample_no_numberNumber of observationsNumber of missingMeanSDCV
FAM0Positive Control -EDTA8029.70.30
FAM0.078Positive Control -EDTA8029.50.20
FAM0.156Positive Control -EDTA8029.60.30
FAM0.313Positive Control -EDTA8029.70.30
FAM0.625Positive Control -EDTA8029.80.20
FAM1.25Positive Control -EDTA8029.90.30
FAM2.5Positive Control -EDTA80300.20
FAM5Positive Control -EDTA8030.10.20
FAM10Positive Control -EDTA8030.30.10
FAM20Positive Control -EDTA8032.50.90
HEX0Positive Control -EDTA8032.40.50
HEX0.078Positive Control -EDTA8032.10.50
HEX0.156Positive Control -EDTA8032.20.50
HEX0.313Positive Control -EDTA8032.20.40
HEX0.625Positive Control -EDTA8032.40.50
HEX1.25Positive Control -EDTA8032.40.30
HEX2.5Positive Control -EDTA8032.70.50
HEX5Positive Control -EDTA8032.30.50
HEX10Positive Control -EDTA8033.10.50
HEX20Positive Control -EDTA8034.62.10.1
Texas Red0Positive Control -EDTA8030.50.20
Texas Red0.078Positive Control -EDTA8030.70.30
Texas Red0.156Positive Control -EDTA8030.60.20
Texas Red0.313Positive Control -EDTA8030.80.30
Texas Red0.625Positive Control -EDTA8030.80.30
Texas Red1.25Positive Control -EDTA8030.70.30
Texas Red2.5Positive Control -EDTA8030.70.30
Texas Red5Positive Control -EDTA8030.90.30
Texas Red10Positive Control -EDTA8031.10.30
Texas Red20Positive Control -EDTA8033.61.10

We compare each concentration with the negative control (when concentration = 0)

df_control = cleaned_df[cleaned_df$Content == “Unkn-10”,] # this is the control
if (nrow (df_control) > 0) {
wilcox_results <- ddply (cleaned_df, c(‘Fluor’, ‘Spikein_Level’, ‘Sample_no_number’), .fun = mann_whitney_function, controls=df_control )

mann_whitney_results_filename = paste (folder, “Mann-Whitney_”, chemical, “.csv”)
}

The Mann Whitney results look something like:

FluorSpikein_LevelSample_no_numberWp-value
FAM0PC-EDTA321
FAM0.078PC-EDTA220.328205128
FAM0.156PC-EDTA321
FAM0.313PC-EDTA321
FAM0.625PC-EDTA420.328205128
FAM1.25PC-EDTA450.194871795
FAM2.5PC-EDTA540.020668221
FAM5PC-EDTA560.01041181
FAM10PC-EDTA640.0001554
FAM20PC-EDTA640.0001554
HEX0PC-EDTA321
HEX0.078PC-EDTA220.328205128
HEX0.156PC-EDTA270.645376845
HEX0.313PC-EDTA290.798445998
HEX0.625PC-EDTA360.720901321
HEX1.25PC-EDTA390.505361305
HEX2.5PC-EDTA470.13038073
HEX5PC-EDTA260.573737374
HEX10PC-EDTA550.014763015
HEX20PC-EDTA610.001087801
Texas Red0PC-EDTA321
Texas Red0.078PC-EDTA460.160528361
Texas Red0.156PC-EDTA400.441802642
Texas Red0.313PC-EDTA470.13038073
Texas Red0.625PC-EDTA500.064957265
Texas Red1.25PC-EDTA480.104895105
Texas Red2.5PC-EDTA450.194871795
Texas Red5PC-EDTA570.006993007
Texas Red10PC-EDTA610.001087801
Texas Red20PC-EDTA640.0001554

For example, FAM signals are significantly different from control starting at 2.5% (p=0.02).

We also look for consistent trends. At all concentrations > 2.5%, the signals are significantly different from control, which is what we expect.

Nov
27

FDA Validation of a PCR test: Pre-processing data (Part 2)

All of the data analyzed used the same format to make it easy to re-use code .

The data for each PCR experiment was in an Excel file or a comma-delimited (.csv) file with the following format.

WellFluorTargetContentSampleBiological Set NameCqCq MeanCq Std. Dev
A01FAMFusionUnkn-1Fusion28.18682328.25895690.10657898
A01HEXInternal ControlUnkn-1Fusion30.8251246930.81717730.193826672
A01Texas RedGene AUnkn-1FusionNaN00

To be able to complete the FDA sections above, data from the different runs need to be combined together. So I manually added an additional column with a unique run name to every run, which allows me to combine the runs together and still distinguish between wells.

RunWellFluorTargetContentSampleBiological Set NameCqCq MeanCq Std. Dev
VP-xxxx-yyy_001A01FAMFusionUnkn-1Fusion28.18682328.25895690.10657898
VP-xxxx-yyy_001A01HEXInternal ControlUnkn-1Fusion30.8251246930.81717730.193826672
VP-xxxx-yyy_001A01Texas RedGene AUnkn-1FusionNaN00

Sometimes we have to exclude a few wells due to operator or technical errors.In a separate file that called “wells_to_exclude.csv”, I listed the Run and Well ID.

RunWellComments
VP-xxxx-yyy_001A08Operator error
VP-xxxx-yyy_002A12Forgot to add template

I can remove these wells with the following code.

For all of my analyses, the first 10 lines of my R code is the same: the data from the PCR runs are read into a data frame and then bad wells are excluded.

Here’s the code:


# libraries that I use a lot
library (tolerance)
library (plyr)
library (EnvStats)

# folder to print all my output
folder = “C:\\Users\\pauline\\Documents\\Fusion\\AnalyticalSpecificity\\”

# folder containing all of the runs combined for this particular experiment.
# See format in Table 2 above.
filename = “C:\\Users\\pauline\\Documents\\Fusion\\EDTA_EtOH_combined.csv”

# read in data
whole_df <- read.csv (filename, header=TRUE)

# make a unique id combining the Run ID and Well ID
whole_df[,”UniqueID”] = paste (whole_df[,”Run”],whole_df[,”Well”])

# remove more problematic wells
df_problem <- read.csv (“C:\\Users\\pauline\\Documents\\Fusion\\wells_to_exclude.csv”, header=TRUE)
problem_samples <- paste (df_problem[,”Run”], df_problem[,”Well”])
whole_df <- whole_df[!(whole_df$UniqueID %in% problem_samples), ]

whole_df is a data frame containing all the data, excluding the problematic wells.

To call a fusion, we use ΔCt. For the same well, we have to calculate ΔCt = FAM Ct – Texas Red Ct.

In all my scripts, I’ll create a new data frame called “channels”, where ΔCt is calculated by merging FAM and Texas Red into to the same row (based on run and well), and then doing the subtraction.


# get Texas Red values & other important stuff
df_TexasRed <- df[df$Fluor == "Texas Red", c("UniqueID", "Fluor", "Target", "Sample", "Spikein_Level", "Sample_no_number", "Content", "Cq")]


# get FAM values only
df_FAM <- df[df$Fluor == "FAM", c("UniqueID", "Fluor", "Cq")]


# merge into same row, based on UniqueID (run & well)
channels <- merge (df_TexasRed ,df_FAM, by="UniqueID")


# calculate deltaCq
channels[,"deltaCq"] <- channels["Cq.y"] - channels["Cq.x"]

If I'm lazy and want to use the same functions as the FAM and Texas Red channels which use the "Ct column", I might also create channels["Ct"], and assign the column ΔCt values.

Mar
15

Choosing the Right College

Is it worth paying more for a “top 10” school?

You get accepted to a private school / Ivy League that will cost a fortune. You’re also accepted by a public university which doesn’t have as good a reputation but costs much less. Does it matter which one you choose?


One factor in your decision is if the school will increase your chance of being successful. A school could be ‘worth it’ if it produces successful people, which we define as its alumni appearing in Wikipedia.


To answer this question, we identified the number of ‘successful’ graduates for each college (defined as the alumni appearing in Wikipedia). We calculated the likelihood of appearing in Wikipedia if one was an alumni from a given college.

Method Details:
The likelihood of a college alumnus appearing in Wikipedia is calculated as a relative ratio.
If the relative ratio is 1, then that means that the number of alumnus observed in Wikipedia follows what’s expected based on the college size.
If the relative ratio is greater than 1, then that means that the number of people in Wikipedia is higher than what’s expected base on its college size — and this school increases your chance of success.
Equations here

The table below shows the colleges with the most Wikipedia enrichment.

For example, Harvard has a relative ratio of 50, which means that alumni are 50x more likely to appear in Wikipedia than expected.


[table “10” not found /]



The relative ratio of appearing in Wikipedia is plotted against the U.S. News & World Report rankings. We see that people who attend the top-ranked schools do have a higher likelihood of appearing in Wikipedia.

alt text

So this supports going to a top-rank school. Also notice what happens after rank 40, the college doesn’t seem to matter for getting into Wikipedia.

Full analysis with gory details

Also, check out my earlier post which show that you don’t even need to go to college for certain professions.

Older posts «

» Newer posts