Uptrain 소개

오픈소스 ML 모니터링 라이브러리 입니다.

What is UpTrain? 🤔 - UpTrain Documentation

ML monitoring 이라는 개념이 어떤 방식으로 이루어지는지 궁금해서 리뷰해 보기로 합니다.

Concept drift와 Data drift란?

Concept drift: 데이터의 해석 방법이 모델 훈련때와 비교하여 변화가 있음
Data drift: 입력 데이터의 통계적 분포가 어떠한 변화에 의해 차이가 발생하고 있는 것을 의미, Feature drift, Covariate shift
Lable drift: 정답 라벨(타겟변수)의 통계적 분포에 무언가 변화 생긴 것을 의미
Prediction drift: 실제 배포 환경에서의 "예측값"의 통계적 분포에 무언가 변화

Key Features 📈 - UpTrain Documentation

Concept Drift - UpTrain Documentation

알고리즘은 Drift Detection Method (DDM)를 사용합니다.

DDM - River

DDM은 PAC(Probably Approximately Correct) 기반 학습 모델에 사용합니다.

PAC Learning

$P_{min}$ : 지금까지 측정한 최소 에러율
$S_{min}$ : 지금까지 측정한 최소 표준편차
경고 판단 지점(warning level): $p_{min} + 2 * s_{min} \le p_{i} + s_{i}$
변경 판단 지점(change detected): $p_{min} + 3 * s_{min} \le p_{i} + s_{i}$

concept_drift_check = {
    'type': uptrain.Anomaly.CONCEPT_DRIFT,
    'algorithm': uptrain.DataDriftAlgo.DDM,
    'warn_thres': 0.5,  # Warning level (defaults to 2)
    'alert_thres': 0.75, # Drift alert threshold (defaults to 3)
}

예제: uptrain_fraud_detection.ipynb - Colaboratory

NSL-KDD 데이터는 다양한 네트워크 공격로그를 포함합니다.

XGBoost를 사용한 예측 모델을 기준으로 설명합니다.

Data Drift - UpTrain Documentation

예제: uptrain/run.ipynb at main · uptrain-ai/uptrain

reference_dataset: drift를 측정의 기준 데이터 입니다. 일반적으로 training 데이터 입니다.

각 벡터를 다음과 같은 방법을 사용하여 drift 여부를 판별합니다.

KL divergence

초보를 위한 정보이론 안내서 - KL divergence 쉽게 보기

training 데이터와 production 데이터의 확률 분포를 비교합니다.

정보량

\begin{align} H & = n \log(s)\\ & =\log (s^n) \end{align}

엔트로피: 모든 사건이 같은 확률로 일어나는 것이 가장 불확실

\begin{align} H & = \sum (사건\,발생확률) \cdot \log_2(\frac{1}{사건\,발생확률}) \\ & = \sum_i p_i \ \log_2(\frac{1}{p_i})\\ & = - \sum_i p_i \ \log_2(p_i) \end{align}

Cross entropy: 어떤 문제에 대해 특정 전략을 쓸 때 예상되는 질문개수에 대한 기댓값

\begin{align} H(p,q) = & \sum_i p_i \ \log_2{\frac{1}{q_i}} \\ =&-\sum_i p_i \ \log_2{q_i} \end{align}

Earth movers distance

t 근접성 EMD(Earth Mover Distance) 계산 : 네이버 블로그

EMD는 분포 A에서 분포 B로 변경하는 일의량 입니다.

Population Stability Index

성능지표 산출원리 - 기업평가등급체계 신용등급체계공시

PSI는 모집단의 안정성을 나타내는 지수로, 기준시점 대비 현재 분포의 차이를 나타내며 수치가 클수록 모집단의 변화가 크다는 것을 의미합니다.

Data Integrity - UpTrain Documentation

데이터의 missing, corrupted, incorrect data type, outlier 여부를 판단합니다.

data_integrity_checks =
[
    {
        'type': uptrain.Anomaly.DATA_INTEGRITY,
        'measurable_args': {
            'type': uptrain.MeasurableType.INPUT_FEATURE,
            'feature_name': 'kps'
        },
        'integrity_type': 'non_null'
    },
    {
        'type': uptrain.Anomaly.DATA_INTEGRITY,
        'measurable_args': {
            'type': uptrain.MeasurableType.CUSTOM,
            'signal_formulae': uptrain.Signal("body_length", body_length_signal),
        },
        "integrity_type": "greater_than",
        "threshold": 50
    },
]

Edge-case Detection - UpTrain Documentation

# Use the custom-defined pushup edge-case signal
pushup_edge_case = uptrain.Signal("Pushup", pushup_signal)

# Defining the model confidence edge-case signal
# That is, identify model confidence <0.9 as an edge-case
low_conf_edge_case = uptrain.Signal(uptrain.ModelSignal.BINARY_ENTROPY_CONFIDENCE, 
                        is_model_signal=True) < 0.9

edge_case_check = [{
    'type': uptrain.Anomaly.EDGE_CASE, 
    "signal_formulae": (pushup_edge_case | low_conf_edge_case)
}],

판다스 - 특잇값(outlier) 찾아내기 : Turkey's Fences, Z-score

Turkey's Fences, Z-score를 포함할 예정입니다.

Model Bias - UpTrain Documentation

사용자 피드백으로부터 precision, recall, F1 score를 계산합니다.

다를 사용자 그룹 사이의 예측 분포를 분석하여 bias를 검출합니다.

UpTrain 리뷰