Pages

2019년 3월 31일 일요일

Phi X 174

What is it?

    1. a single-stranded DNA(ssDNA) virus that infects Escherichia coli
    2. the first DNA-based genome to be sequenced in 1977
    3. Well-defined, small(5,396bp), and diverse(45% GC, 55% AT) genome
    4. fasta file download link:
      1. PhiX_from_Illumina
      2. PhiX_from_NCBI
    5. Using it as a positive control in Illumina NGS



    What are benefits of using PhiX control?

    1. Calibration Control: can be run alone and serves as a calibration control for;
      1. Cluster generation: can be used as a positive control in the clustering process

        PlatformMode/ReagentsOptimal Raw Cluster Density
        HiSeqHigh Output, TruSeq v3750-850 K/mm²
        High Output, HiSeq v4
        (required upgrade)
        950-1050 K/mm²
        Rapid v2850-1,000 K/mm²
        MiSeqv21,000-1,200 K/mm²
        v31,200-1,400 K/mm²
        MiniSeqMid and High Output170-220 K/mm²
        NextSeqMid and High Output, v2170-220 K/mm²
        [table 1] Cluster density guidelines for Illumina sequencing platforms

      2. Cross talk matrix generation
        1. During an illumina sequencing run, the cross-talk due to spectral overlap between the 4 fluorescently labeled nucleotides is calculated during template generation in cycle 1-5
        2. https://www.slideshare.net/idtdna/unique-dualmatched-adapters-mitigate-index-hopping-between-ngs-samples
      3. Phasing and Prephasing
        1. During sequencing by synthesis, each DNA strand in a cluster extends by 1 base per cycle
        2. A small proportion of strands may become out of phase with the current cycle, either falling a base behind(phasing) or jumping a base ahead(prephasing)
        3. For best results, use a PhiX spike-in as a control with any library that does not comprise a balanced base composition
        4. High GC samples(≧ 60%) typically show higher phasing rates, and in this case a PhiX control is required

    2. Run quality monitor: due to its small size and balanced nucleotide composition, it's an ideal in-run control (typically with >= 1% spike-in) for run quality monitoring

      PlatformPhiX Aligned(%)
      iSeq 100minimum 5%
      MiniSeq10~50%
      MiSeq
      (MCS 2.2 or higher)
      minimum 5%
      NextSeq10~50%
      HiSeq 2500
      (HCS 2.2.38 or higher)
      minimum 10%
      HiSeq 3000/4000
      (HCS 3.3.76 or lower)
      10~50%
      HiSeq 3000/4000
      (HCS 3.4.0 or higher)
      5~20%
      NovaSeqminimum 10%
      [table 2] PhiX Control v3 library Illumina recommends spiking in when running low diversity libraries

    3. Color balancing
      1. For low diversity libraries, the PhiX Control v3 library provides balanced fluorescent signals at each cycle to improve the overall run quality
      2. You can find why the nucleotide diversity is important in here

    How to remove PhiX reads from the fastq


      Nucleotide Diversity

      Nucleotide diversity는 무엇인가요?

      1. High nucleotide diversity: Library가 sequencing의 모든 cycle에서 4개의 nucleotides를 거의 동등한 비율로 골고루 가지고 있는 상태를 의미합니다.
      2. 아래 그림에서 well-balanced 상태와 unbalanced 상태의 diversity와 base-balance를 나타내고 있습니다. 그리고 이런 상태가 Sequencing Analysis Viewer(SAV)의 % base plot에서 어떻게 반영되어 나타나는지 보여주고 있습니다.
      [fig 1] Illustrates of the diversity and base-balance

      Nucleotide diversity는 왜 중요한가요?

      1. Nucleotide diversity는 효과적인 template 생성에 필요하며 high-quality data 생산에 중요합니다.
      2. Diversity는 MiniSeq, MiSeq, NextSeq, 그리고 HiSeq 1000-2500 system에서 첫 번째 sequencing read의 4~7 cycles 동안 중요합니다. Sequencing software는 template generation이라 불리는 과정에서 이런 앞 부분 cycles의 image들을 사용하여 각 cluster의 위치를 확인합니다.
      3. Diversity는 또한 첫 25 cycle에서도 중요한데, phasing/pre-phasing, color matrix corrections, 그리고 pass filter calculations가 일어나는 과정이기 때문입니다.
      4. Real-Time Analysis(RTA) software는 적절한 양의 PhiX 투입이 필요합니다. 이와 관련된 내용은 here에서 찾아볼 수 있습니다.
        ref)
          https://support.illumina.com/bulletins/2016/07/what-is-nucleotide-diversity-and-why-is-it-important.html

          2019년 3월 29일 금요일

          [K-MOOC] Data Analytics for Forecasting and Classification: Syllabus


          • Course: Data Analytics for Forecasting and Classification
          • Professor: Chi-hyuk Jeon / POSTECH
          • Goals
            • Understanding data analysis methods for forecasting and classification based on statistics
            • Cultivating data analysis skill and application ability by using data analytics methods
          • Prerequisite
            • Probability and Statistics, Linear Algebra, Optimization
          • Schedule

          [K-MOOC] Data Analytics for Forecasting and Classification: 1-1. Regression analysis, Simple regression model, Model estimation

          Regression Analysis

          1. In order to explain a variable, to analyze statistical causal relationships between related variables
          2. independent variable: causes
          3. dependent variable: outcomes

          Regression Model

          1. Simple Regression Model
            1. 𝑿 ⇨ 𝒀
            2. Observation: (𝑿₁,𝒀₁), (𝑿₂,𝒀₂), ... , (𝑿𝘯,𝒀𝘯) (𝑛 is observation number)
            3. Simple Regression Model: 
              1. 𝒀𝑖 = 𝜷₀ + 𝜷₁𝑿𝑖 + 𝑖,    𝑖 = 1,2, ... , 𝑛
                1. 𝑖: error term. 
                  1. Assume that it follows a normal distribution with mean 0 and variance 𝛔²
                  2. 𝑖~𝙉𝙤𝙧(0,𝛔²)
                2. 𝑿 is not random variable, but a given value
                3. so, three parameters need to be estimated
                  1. 𝜷₁: slope of the linear equation
                  2. 𝜷₀: intercept
                  3. 𝛔²: variance of the error term
            4. Estimation of intercept 𝜷₀ and slope 𝜷₁
              1. Using least squares method
              2. to minimize the objective function 𝐐
              3. objective function 𝐐
                1. sum of the square of the difference between the observed value of dependent variable 𝒀, and the fitted value provided by the model on the linear line 𝜷₀ + 𝜷₁𝑿𝑖
                2. 𝐐 = ∑(𝒀𝑖 - 𝜷₀ - 𝜷₁𝑿𝑖)²
            5. How to?
              1. (𝑿,𝒀) is observed value, so let 𝐐 be a function of 𝜷₀ and 𝜷₁ 
              2. and partially differentiate 𝐐 with respect to 𝜷₀
                = -2∑(𝒀𝑖 - 𝜷₀ - 𝜷₁𝑿𝑖) = 0
              3. and partially differentiate 𝐐 with respect to 𝜷₁
                = -2∑(𝒀𝑖 - 𝜷₀ - 𝜷₁𝑿𝑖)𝑿𝑖 = 0
              4. estimated equation: 𝒀-hat = 𝜷₀-hat + 𝜷₁-hat * 𝑿
            6. Estimation of variance of the error term 𝛔²
              1. Using sample variance of the residuals
                1. residual
                  substract the estimated value from the observed value of 𝒀
                  𝒆𝑖 = 𝒀𝑖 - 𝒀-hat = 𝒀𝑖 - 𝜷₀-hat + 𝜷₁-hat * 𝑿𝑖
                2. SSE
                  resudual/error sum of squares
                  = ∑(𝒀𝑖 - 𝒀𝑖-hat)²
                3. estimate 𝛔² by using MSE
                  𝛔²-hat = MSE(Mean Squared Error) = SSE / 𝑛-2
                  (𝑛-2) is  degree of freedom

            2019년 3월 28일 목요일

            [A6000 + 30.4] Piazzale Michelangelo6


            2019. 03
            from Piazzale Michelangelo, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Piazzale Michelangelo5


            2019. 03
            from Piazzale Michelangelo, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Piazzale Michelangelo4


            2019. 03
            from Piazzale Michelangelo, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Piazzale Michelangelo3


            2019. 03
            from Piazzale Michelangelo, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Piazzale Michelangelo2


            2019. 03
            from Piazzale Michelangelo, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Piazzale Michelangelo1


            2019. 03
            from Piazzale Michelangelo, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            2019년 3월 27일 수요일

            [A6000 + 30.4] Battistero di San Giovanni


            2019. 02
            from Battistero di San Giovanni, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Ponte Vecchio


            2019. 02
            from Ponte Vecchio, Florence, Italy
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Ttukseom Hangang Park5


            2018. 06
            from Ttukseom Hangang Park
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Ttukseom Hangang Park4


            2018. 06
            from Ttukseom Hangang Park
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Ttukseom Hangang Park3


            2018. 06
            from Ttukseom Hangang Park
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Ttukseom Hangang Park2


            2018. 06
            from Ttukseom Hangang Park
            Sony A6000 + Sigma 30mm f1.4

            [A6000 + 30.4] Ttukseom Hangang Park1


            2018. 06
            from Ttukseom Hangang Park
            Sony A6000 + Sigma 30mm f1.4

            Standard deviation 추정할 때 n-1로 나누는 이유

            두 가지 이유가 있습니다.

            1. 표본분산과 모분산의 차이를 줄이기 위해서
              ( 직관적 이유 )
              1. 1/n 은 모분산의 최대 우도 추정치이지만, 수학적으로도 bias가 존재하는 값 입니다.
              2. 표본분산은 보통 모분산보다 작습니다.
                굉장히 큰 모집단에서 sampling 하면 중앙값 부근에서 표본이 많이 추출될 것이고, 표본분산은 모분산보다 작은 경향을 보일 것입니다.
              3. 1/n-1(unbiased 추정치) 을 사용함으로써 둘 사이의 gap을 줄일 수 있습니다.
              4. 그럼 n-2 는 안되나요?
                1. 이는 자유도와 관련있습니다.
            2. n-1 로 나눌 때 표본분산과 모분산을 계산하면 일치하기 때문에
              ( 수학적인 이유 )
              1. 다음과 같이 가정한 뒤,
                : sample size
                : sample mean
                : sample variance
                : population mean
                : population variance
              2. 아래 수식이 참임을 밝혀봅니다.

              3. first,








              4. as here,



            2019년 3월 25일 월요일

            xargs

            선행 명령문의 출력을 다음 명령문의 인수로 사용하기

            xargs 는 standard input으로부터 data stream을 읽은 후 다음 명령문을 생성하고 실행하는 기능을 합니다.

            1. 특정 경로에 존재하는 모든 vcf 확장자 파일 출력하기
              echo `pwd` | cut -d'/' -f1-7 | xargs -I{} find {} -name '*.vcf'
            2. 특정 파일의 라인 수/ 단어 수/ 문자 수 count 하기
              ls * | xargs -IFILE grep -vc '^#' FILE
            3. vcf 파일의 모든 variants 수 count 하기
              ls * | xargs -IFILE echo FILE | xargs grep -vc '^#' FILE