Hubert Life (out)

2019년 4월 18일 목요일

[K-MOOC] Instruction to Deep Learning: 1-2. Methodology of Machine Learning

Machine Learning is so Extensive and Sophisticated

Machine Learning Tasks

Classification

to classify the data into specific category
categories are pre-assigned

Regression

Linear regression

to find linear function that explain independent variable x and dependent variable y in given data set, {(x, y)}
simple linear regression:
y-hat = f(x) = 𝜷₀ + 𝜷₁𝑿𝑖

Logistic regression

is different to linear regression in that dependent variable(y) is nominal type
is a kind of stochastic elements

Clustreing

is almost the same to the classification, but the only one difference is that is has no certain categorues
to make cluster by learning characters of independent data (not need training)
application cases

to classify document: by frequency of words
to classify satellite pictures: by color tone

Machine Learning Modes

Supervised learning

teacher + students
classification, regression

Unsupervised learning

is trained with unlabeled data (without answer)
clustering

Reinforcement learning

was invented in 1990s, and is spotlighted in these days
is learning method that makes computer plays better than human being
to mapping the state to an specific action which brings the best reward

Machine Learning Technique

[K-MOOC] Instruction to Deep Learning: Syllabus

Course: Instruction to Deep Learning
Professor: Hee-chul Kim / Daegu University
Goals

Understanding principle of deep learning and algorithm

Prerequisite

Mathematics(matrix, vector, differential), Computer

Schedule

1 week: Understanding of machine learning

1-1. Outline of Machine Learning
1-2. Methodology of Machine Learning

[K-MOOC] Instruction to Deep Learning: 1-1. Outline of machine learning

Definition of Artificial Intelligence

the state that machine has intelligence (Nils Nilsson, 2010, 'The quest of AL')
but the problem is that the intelligence is ambiguous

Practical Definition of AI

the technology that machine could carry out a process in smart way
whole range of research of AI scientist (Stanford AI 100 years Report)
the complex of all elements of recognition process in human being

History of AI

1956s:

a first use of the term, 'Artificial Intelligence',
meaning of 'to proceed on the basis of the conjecture that every of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it' (on Dartmouth College Workshop)

1970s, 1990s: First AI winter, even though huge amount of investments and financial supports is gathered, the outcome was none and all those supports were cut off
2010s: the interest and expectation to AI is growing

Relationship of AI, ML, DL

Definition of Machine Learning

the completely different method from the conventional programming method

programming method: data → program → output
Machine learning: data & output → algorithm → program

a detailed field of AI that functions intellectually after a computer is learning from its experience
in mathematical meaning,

y = h(x)
y: output
h: function
x: feature
is finding the function h, h( ), which is the closest to target function by using a set of sample, S={(x, y)}

What is the Deep Learning?

a multilayered structure of Neural Networks (large scale)
is being trained by hierarchical abstract learning

Benefits of DL

End-to-End learning: just give input data and get output

2019년 3월 31일 일요일

Phi X 174

What is it?

a single-stranded DNA(ssDNA) virus that infects Escherichia coli
the first DNA-based genome to be sequenced in 1977
Well-defined, small(5,396bp), and diverse(45% GC, 55% AT) genome
fasta file download link:

Using it as a positive control in Illumina NGS

What are benefits of using PhiX control?

Calibration Control: can be run alone and serves as a calibration control for;

Cluster generation: can be used as a positive control in the clustering process

Platform	Mode/Reagents	Optimal Raw Cluster Density
HiSeq	High Output, TruSeq v3	750-850 K/mm²
	High Output, HiSeq v4 (required upgrade)	950-1050 K/mm²
	Rapid v2	850-1,000 K/mm²
MiSeq	v2	1,000-1,200 K/mm²
MiSeq	v3	1,200-1,400 K/mm²
MiniSeq	Mid and High Output	170-220 K/mm²
NextSeq	Mid and High Output, v2	170-220 K/mm²

[table 1] Cluster density guidelines for Illumina sequencing platforms

Cross talk matrix generation

During an illumina sequencing run, the cross-talk due to spectral overlap between the 4 fluorescently labeled nucleotides is calculated during template generation in cycle 1-5
https://www.slideshare.net/idtdna/unique-dualmatched-adapters-mitigate-index-hopping-between-ngs-samples

Phasing and Prephasing

During sequencing by synthesis, each DNA strand in a cluster extends by 1 base per cycle
A small proportion of strands may become out of phase with the current cycle, either falling a base behind(phasing) or jumping a base ahead(prephasing)
For best results, use a PhiX spike-in as a control with any library that does not comprise a balanced base composition
High GC samples(≧ 60%) typically show higher phasing rates, and in this case a PhiX control is required

Run quality monitor: due to its small size and balanced nucleotide composition, it's an ideal in-run control (typically with >= 1% spike-in) for run quality monitoring

Platform	PhiX Aligned(%)
iSeq 100	minimum 5%
MiniSeq	10~50%
MiSeq (MCS 2.2 or higher)	minimum 5%
NextSeq	10~50%
HiSeq 2500 (HCS 2.2.38 or higher)	minimum 10%
HiSeq 3000/4000 (HCS 3.3.76 or lower)	10~50%
HiSeq 3000/4000 (HCS 3.4.0 or higher)	5~20%
NovaSeq	minimum 10%

[table 2] PhiX Control v3 library Illumina recommends spiking in when running low diversity libraries

Color balancing

For low diversity libraries, the PhiX Control v3 library provides balanced fluorescent signals at each cycle to improve the overall run quality
You can find why the nucleotide diversity is important in here

How to remove PhiX reads from the fastq

Nucleotide Diversity

Nucleotide diversity는 무엇인가요?

High nucleotide diversity: Library가 sequencing의 모든 cycle에서 4개의 nucleotides를 거의 동등한 비율로 골고루 가지고 있는 상태를 의미합니다.
아래 그림에서 well-balanced 상태와 unbalanced 상태의 diversity와 base-balance를 나타내고 있습니다. 그리고 이런 상태가 Sequencing Analysis Viewer(SAV)의 % base plot에서 어떻게 반영되어 나타나는지 보여주고 있습니다.

[fig 1] Illustrates of the diversity and base-balance

Nucleotide diversity는 왜 중요한가요?

Nucleotide diversity는 효과적인 template 생성에 필요하며 high-quality data 생산에 중요합니다.
Diversity는 MiniSeq, MiSeq, NextSeq, 그리고 HiSeq 1000-2500 system에서 첫 번째 sequencing read의 4~7 cycles 동안 중요합니다. Sequencing software는 template generation이라 불리는 과정에서 이런 앞 부분 cycles의 image들을 사용하여 각 cluster의 위치를 확인합니다.
Diversity는 또한 첫 25 cycle에서도 중요한데, phasing/pre-phasing, color matrix corrections, 그리고 pass filter calculations가 일어나는 과정이기 때문입니다.
Real-Time Analysis(RTA) software는 적절한 양의 PhiX 투입이 필요합니다. 이와 관련된 내용은 here에서 찾아볼 수 있습니다.

ref)

https://support.illumina.com/bulletins/2016/07/what-is-nucleotide-diversity-and-why-is-it-important.html

2019년 3월 29일 금요일

[K-MOOC] Data Analytics for Forecasting and Classification: Syllabus

Course: Data Analytics for Forecasting and Classification
Professor: Chi-hyuk Jeon / POSTECH
Goals

Understanding data analysis methods for forecasting and classification based on statistics
Cultivating data analysis skill and application ability by using data analytics methods

Prerequisite

Probability and Statistics, Linear Algebra, Optimization

Schedule

1 week

1-1. Regression analysis, Simple regression model, Model estimation

[K-MOOC] Data Analytics for Forecasting and Classification: 1-1. Regression analysis, Simple regression model, Model estimation

Regression Analysis

In order to explain a variable, to analyze statistical causal relationships between related variables
independent variable: causes
dependent variable: outcomes

Regression Model

Simple Regression Model

𝑿 ⇨ 𝒀
Observation: (𝑿₁,𝒀₁), (𝑿₂,𝒀₂), ... , (𝑿𝘯,𝒀𝘯) (𝑛 is observation number)
Simple Regression Model:

𝒀𝑖 = 𝜷₀ + 𝜷₁𝑿𝑖 + ℇ𝑖, 𝑖 = 1,2, ... , 𝑛

ℇ𝑖: error term.

Assume that it follows a normal distribution with mean 0 and variance 𝛔²
ℇ𝑖~𝙉𝙤𝙧(0,𝛔²)

𝑿 is not random variable, but a given value
so, three parameters need to be estimated

𝜷₁: slope of the linear equation
𝜷₀: intercept
𝛔²: variance of the error term

Estimation of intercept 𝜷₀ and slope 𝜷₁

Using least squares method
to minimize the objective function 𝐐
objective function 𝐐

sum of the square of the difference between the observed value of dependent variable 𝒀, and the fitted value provided by the model on the linear line 𝜷₀ + 𝜷₁𝑿𝑖
𝐐 = ∑(𝒀𝑖 - 𝜷₀ - 𝜷₁𝑿𝑖)²

How to?

(𝑿,𝒀) is observed value, so let 𝐐 be a function of 𝜷₀ and 𝜷₁
and partially differentiate 𝐐 with respect to 𝜷₀
= -2∑(𝒀𝑖 - 𝜷₀ - 𝜷₁𝑿𝑖) = 0
and partially differentiate 𝐐 with respect to 𝜷₁
= -2∑(𝒀𝑖 - 𝜷₀ - 𝜷₁𝑿𝑖)𝑿𝑖 = 0
estimated equation: 𝒀-hat = 𝜷₀-hat + 𝜷₁-hat * 𝑿

Estimation of variance of the error term 𝛔²

Using sample variance of the residuals

residual
substract the estimated value from the observed value of 𝒀
𝒆𝑖 = 𝒀𝑖 - 𝒀-hat = 𝒀𝑖 - 𝜷₀-hat + 𝜷₁-hat * 𝑿𝑖
SSE
resudual/error sum of squares
= ∑(𝒀𝑖 - 𝒀𝑖-hat)²
estimate 𝛔² by using MSE
𝛔²-hat = MSE(Mean Squared Error) = SSE / 𝑛-2
(𝑛-2) is degree of freedom