정리3

728x90

1. 라이브러리 임포트

pandas를 `pd`, numpy를 `np`, seaborn을 `sns`, matplotlib.pyplot을 `plt`로 임포트하세요.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

2. 데이터 불러오기

edu_users.csv 파일을 불러와 edu_df 데이터프레임을 생성합니다.
- pandas를 이용하여 파일을 읽고 `edu_df`에 저장하세요.

edu_df = pd.read_csv('edu_users.csv')

pd.read_csv

3. 데이터프레임 구조 확인

데이터의 전체 구조(변수 타입, 결측치 여부 등)를 파악하여 이 정보를 바탕으로 다음 중 **옳지 않은 설명**을 고르세요.
- edu_df에 대해 .info()를 출력하세요.
- 정답 번호를 `답03` 변수에 저장하세요.

1. `rate_math`, `rate_science`, `tot_time`은 결측치가 존재하며, 모두 실수형(`float64`)이다.
2. `device`, `gender`, `resub_yn`, `comb_yn`, `othersub_yn`은 범주형 변수이지만, `dtype`은 모두 `object`로 표시된다.
3. 전체 열은 13개이며, 이 중 수치형(numeric) 변수는 총 8개다.
4. `leave_yn`은 이탈 여부를 나타내는 이진 분류형 변수이며, 문자형(object)으로 저장되어 있다.

edu_df.info()
답03 = 4

.info( ) 괄호!

4. 주요 과목 시청률 상관관계 분석

주요 과목 시청률 간 상관관계 분석

아래 셀을 실행해서 corr_list 변수 생성

corr_list = ['rate_math', 'rate_science', 'rate_english', 'rate_humanities', 'rate_art']

corr_list를 활용하여 heatmap을 시각화하고 해석하여, 가장 적절한 설명을 고르세요.
정답은 '답04' 변수에 저장하세요
annot = True
소수점은 2째자리까지만 출력

1) rate_math와 rate_science는 강한 양의 상관관계를 가진다.
2) rate_english와 rate_humanities는 -0.5 이상의 음의 상관관계를 가진다.
3) 전체 과목 간 상관계수는 거의 없으며, 독립적인 경향을 보인다.
4) rate_art와 다른 과목들은 대체로 양의 상관관계를 가진다.

sns.heatmap(data=edu_df[corr_list].corr(),
            annot=True,
            fmt=".2f")
답04 = 3

sns.heatmap(data=edu_df[corr_list].corr(),
annot=True,
fmt=".2f")

.corr() 선택된 열들 간의 피어슨 상관계수(correlation coefficient) 행렬을 계산합니다. 이 행렬은 각 변수쌍의 상관계수를 -1~1 사이의 값으로 반환 - - - 중요
annot = True 각 셀(각 변수쌍의 상관계수) 위에 실제 상관계수 숫자를 표시
fmt=".2f" 소수 둘째 자리까지 " " 필수⭐️

5. 이탈 여부에 따른 학습 시간 분포 시각화

이탈 그룹과 유지 그룹 간 학습 시간 분포를 비교합니다.

leave_yn 별로 tot_time의 boxplot을 그리고, 이를 바탕으로 다음 보기 중 올바른 해석을 고르세요.
1) 유지 사용자(0)는 이탈 사용자(1)보다 평균 학습 시간이 짧고, 이상치가 거의 없다.
2) 이탈 사용자(1)의 중앙값은 유지 사용자(0)보다 높으며, 전체적인 학습 시간도 길다.
3) 유지 사용자(0)의 학습 시간 분포가 더 넓고 이상치도 더 많이 존재한다.
4) 두 그룹 모두 중앙값은 유사하나, 이탈자 쪽이 이상치에 의해 상자 높이가 극단적으로 커졌다.
정답은 답05 변수에 저장

edu_df.groupby('leave_yn')['tot_time'].mean()
sns.boxplot(data=edu_df, x='leave_yn', y='tot_time')
답05 = 3

edu_df.groupby('leave_yn')['tot_time'].mean()

edu_df.groupby('leave_yn')['tot_time'].agg(['mean', 'median'])
sns.boxplot(data=edu_df, x='leave_yn', y='tot_time')
답05 = 3

6. 이탈 여부별 주요 과목 평균 시청률 계산

이탈 여부에 따른 과목 시청률 평균을 비교합니다.

- leave_yn 별로 rate_math, rate_science의 평균을 pivot_table()을 이용하여 만드세요.

pivot_df = edu_df.pivot_table(
    index='leave_yn',
    values=['rate_math', 'rate_science'],
    aggfunc='mean'
)
pivot_df

pivot_df = edu_df.pivot_table(
    index='leave_yn',
    values=['rate_math', 'rate_science'],
    aggfunc='mean'
)
pivot_df

1) 파라미터 주의 : x, y, hue가 아닌 index/values/aggfunc

2) index 가 기준이 되는 것, values 대괄호

3) 각 파라미터 마다 모두 '따옴표'

7. 결측치 처리

결측치를 적절히 처리하여 데이터 품질을 향상시킵니다.
- rate_math 컬럼의 결측치를 평균(mean)으로 채우세요.
- rate_science 컬럼의 결측치를 중앙값(median)으로 채우세요.
- tot_time 컬럼의 결측치를 최빈값(mode)으로 채우세요.

edu_df['rate_math'].fillna(edu_df['rate_math'].mean(), inplace=True)
edu_df['rate_science'].fillna(edu_df['rate_science'].median(), inplace=True)
edu_df['tot_time'].fillna(edu_df['tot_time'].mode()[0], inplace=True)

edu_df['rate_math'].fillna(edu_df['rate_math'].mean(), inplace=True)
edu_df['rate_science'].fillna(edu_df['rate_science'].median(), inplace=True)
edu_df['tot_time'].fillna(edu_df['tot_time'].mode()[0], inplace=True)

중요1 : .mean( ) / .median( ) / .mode( )[0]

중요2 : 아래 둘 중 하나 택 1 해야 함 - 후자로 가자

edu_df['rate_math'].fillna(edu_df['rate_math'].mean(), inplace=True)
edu_df['rate_math'] = edu_df['rate_math'].fillna(edu_df['rate_math'].mean())
edu_df['rate_math'] = edu_df['rate_math'].fillna(edu_df['rate_math'].mean(), inplace=True) 는 오답 :
이렇게 쓰면 오른쪽의 fillna(..., inplace=True) 실행 결과가 None이기 때문에, 결국 왼쪽 변수(edu_df['rate_math'])에 **None**이 할당. 그 결과, 원래의 데이터가 없어지고 rate_math 컬럼 전체가 None 값이 되어버림

728x90

8. 학습 (train) / 평가 (test) 분리

Feature(X)와 Target(y)을 이용하여 모델 학습과 평가를 위해 분리하려고 합니다.
- Feature: rate_math, rate_science, rate_english, tot_time
- Target: leave_yn
- 데이터셋을 분리하기 위해 train_test_split을 import하세요.
- 훈련 데이터 셋 = X_train, y_train
- 검증 데이터 셋 = X_valid, y_valid
- random_state=42
- 훈련 데이터셋과 검증 데이터셋의 비율은 8:2

Feature = edu_df[['rate_math', 'rate_science', 'rate_english', 'tot_time']]
Target = edu_df['leave_yn']

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(Feature, Target, test_size=0.2, random_state=42)

Feature = edu_df[['rate_math', 'rate_science', 'rate_english', 'tot_time']] 데이터프레임 형식이어야함!!!!!
Target = edu_df['leave_yn']

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(Feature, Target, test_size=0.2, random_state=42)

9. Feature 스케일링

데이터 스케일링을 통해 모델 학습의 성능을 높입니다.
아래 셀을 실행하여 StandardScaler를 불러오세요.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_valid)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_valid)

10. 랜덤 포레스트 모델 학습

랜덤포레스트 모델을 활용하여 이탈 여부를 예측
- RandomForestClassifier를 사용해 leave_yn을 예측하는 모델을 학습하세요.
- 랜덤포레스트모델을 rfc변수에 저장하세요.
- n_estimators=150, max_depth=7, min_samples_split=5, random_state=42
- 훈련은 스케일링 처리된 데이터를 사용해 주세요.

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=150, max_depth=7, min_samples_split=5, random_state=42)
rfc.fit(X_train, y_train)

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=150, max_depth=7, min_samples_split=5, random_state=42)
rfc.fit(X_train, y_train)

매개변수 설명:
n_estimators=150
생성할 결정 트리(decision tree)의 개수를 150개로 지정합니다. 트리 수가 많을수록 모델은 보통 더 안정적이고 정확해지지만, 학습과 예측 속도는 느려질 수 있습니다.
max_depth=7
각 트리의 최대 깊이를 7로 제한합니다. 깊이가 너무 크면 트리가 지나치게 복잡해져서 과적합(overfitting)이 발생할 수 있으므로, 적절한 깊이 제한은 모델의 일반화 성능을 높입니다.
min_samples_split=5
노드를 분할하기 위한 최소 샘플 수를 5로 설정합니다. 즉, 어떤 노드를 분할하려면 그 노드에 적어도 5개의 샘플이 있어야 분할이 일어납니다. 이 값이 클수록 너무 세세한 분할을 방지할 수 있습니다.
random_state=42
랜덤 시드(seed)로, 모델의 랜덤성(랜덤 샘플링, 트리 구성 등)을 고정시켜 실행할 때마다 같은 결과를 재현할 수 있게 합니다. 주로 재현성 확보를 위해 설정합니다.

11. XGBoost 모델 학습

XGBoost를 활용하여 이탈 여부를 예측합니다.

- XGBClassifier를 사용해 leave_yn을 예측하는 모델을 학습하세요.
- XGBoost모델을 xgbc변수에 저장하세요.
- n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42
- 훈련은 스케일링 처리된 데이터를 사용하여 주세요.

from xgboost import XGBClassifier

xgbc = XGBClassifier(n_estimators = 200, max_depth = 5, learning_rate = 0.1, random_state = 42)
xgbc.fit(X_train, y_train)

from xgboost import XGBClassifier

xgbc = XGBClassifier(n_estimators = 200, max_depth = 5, learning_rate = 0.1, random_state = 42)
xgbc.fit(X_train, y_train)

12. 모델 성능 평가

학습된 모델의 정확도와 F1 점수를 비교합니다.

- randomforest의 predict를 이용하여 검증 데이터를 예측하고, 예측된 데이터는 rfc_predict변수에 저장하여 주세요.
- xgboost의 predict를 이용하여 검증 데이터를 예측하고, 예측된 데이터는 xgbc_predict변수에 저장하여 주세요.
- accuracy_score와 f1_score를 사용하여 두 모델의 성능을 비교하고, 정확도가 더 높은 모델을 답12 변수에 저장하세요.(예 : 답12 = 'RF' or 'XGBC')

from sklearn.metrics import accuracy_score, f1_score

rfc_predict = rfc.predict(X_test)
xgbc_predict = xgbc.predict(X_test)

print("Random Forest")
print("Accuracy:", accuracy_score(y_valid, rfc_predict))
print("F1 Score:", f1_score(y_valid, rfc_predict))

print("XGBoost")
print("Accuracy:", accuracy_score(y_valid, xgbc_predict))
print("F1 Score:", f1_score(y_valid, xgbc_predict))

답12 = 'XGBC'

from sklearn.metrics import accuracy_score, f1_score

rfc_predict = rfc.predict(X_valid)
xgbc_predict = xgbc.predict(X_valid)

print("Random Forest")
print("Accuracy:", accuracy_score(y_valid, rfc_predict))
print("F1 Score:", f1_score(y_valid, rfc_predict))

print("XGBoost")
print("Accuracy:", accuracy_score(y_valid, xgbc_predict))
print("F1 Score:", f1_score(y_valid, xgbc_predict))

답12 = 'XGBC'

13. 딥러닝 분류 모델 설계 및 학습

딥러닝 모델을 구성하여 이탈 여부를 예측합니다.
모델 구조
  - Dense(64, selu) → BatchNormalization
  - Dense(32, selu) → BatchNormalization
  - Dense(16, selu)
  - Dense(1, sigmoid)

학습 설정
  - Optimizer: adam
  - Loss: binary_crossentropy
  - Metric: accuracy
  - Epochs: 45
  - Batch size: 32

model = Sequential([
    Dense(64, 'selu'),
    BatchNormalization(),
    Dense(32, 'selu'),
    BatchNormalization(),
    Dense(16, 'selu'),
    Dense(1, 'sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(X_train, y_train, epochs=45, batch_size=32)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
tf.random.set_seed(10)

model = Sequential()
model.add(Dense(64, activation='selu', input_shape=(X_train.shape[1],)))
model.add(BatchNormalization())
model.add(Dense(32, activation='selu'))
model.add(BatchNormalization())
model.add(Dense(16, activation='selu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(
    X_train, y_train,
    epochs=45,
    batch_size=32
)

model = Sequential([
    Dense(64, 'selu'),
    BatchNormalization(),
    Dense(32, 'selu'),
    BatchNormalization(),
    Dense(16, 'selu'),
    Dense(1, 'sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(X_train, y_train, epochs=45, batch_size=32)

model = Sequential()
model.add(Dense(64, activation='selu', input_shape=(X_train.shape[1],)))
model.add(BatchNormalization())
model.add(Dense(32, activation='selu'))
model.add(BatchNormalization())
model.add(Dense(16, activation='selu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

model.fit(
    X_train, y_train,
    epochs=45,
    batch_size=32
)

* .compile

optimizer='adam'
Adam(Adaptive Moment Estimation) 최적화 알고리즘
학습률 자동 조정, 속도와 성능이 좋아 널리 사용됨

loss='binary_crossentropy'
이진분류 손실 함수
출력이 확률이므로 cross-entropy로 참값과 예측값 간의 차이를 계산

metrics=['accuracy']
평가 지표로 정확도(Accuracy)를 사용

* .fit

X_train, y_train: 학습 데이터(features, labels)

epochs=45
전체 학습 데이터를 45번 반복해서 학습
너무 크면 과적합 위험, 너무 작으면 학습 부족

batch_size=32
한 번의 파라미터 업데이트를 위해 사용하는 샘플 수
GPU Memory와 학습 안정성을 고려한 일반적인 값

14. 딥러닝 모델을 이용한 시뮬레이션 예측

딥러닝 모델을 이용해 새로운 데이터를 예측합니다.

simul_data = np.array([[0.3, 0.5, 0.7, 120]])

- simul_data를 이용해 예측 결과를 출력하고 해당 결과를 deep_pre 변수로 저장하세요.
- 저장된 데이터는 sigmoid로 인해 확률 값을 나타내고 있습니다.
- 0.5를 기준으로 미만이면 0, 이상이면 1로 저장하는 코드를 작성하고 해당 값을 '답14' 변수에 저장하세요.
- numpy의 where을 사용하여 처리하세요.

deep_pre = model.predict(simul_data)

답14 = np.where(deep_pre >= 0.5, 1, 0)

deep_pre = model.predict(simul_data)
답14 = np.where(deep_pre >= 0.5, 1, 0)

model.predict(simul_data)
기존에 fit()으로 학습이 완료된 Keras 모델을 사용해, simul_data(새로운 입력 데이터)에 대한 예측값을 생성합니다.
현재 모델 출력층이 Dense(1, activation='sigmoid')이므로, predict 결과값은 0 ~ 1 사이의 확률 값입니다.

0에 가까울수록 클래스 0일 가능성 큼
1에 가까울수록 클래스 1일 가능성 큼

deep_pre 변수 저장: 이 확률 값들을 deep_pre라는 변수에 그대로 저장

np.where(조건, 참일 때 값, 거짓일 때 값)
여기서는
조건: deep_pre >= 0.5 → 예측 확률이 0.5 이상이면 클래스 1
참일 때 값: 1
거짓일 때 값: 0

따라서,
예측 확률 ≥ 0.5 → 1
예측 확률 < 0.5 → 0

출처: 내돈내산 문제집

내돈내산으로 책 구매 O / 링크는 광고(쿠팡 파트너스 활동의 일환으로, 이에 따른 일정액의 수수료를 제공받습니다.)

[이패스코리아] 2025 이패스 AI능력시험 AICE Associate /사은품 마스크제공 - 컴퓨터 입문/활용 | 쿠팡

쿠팡에서 [이패스코리아] 2025 이패스 AI능력시험 AICE Associate /사은품 마스크제공 구매하고 더 많은 혜택을 받으세요! 지금 할인중인 다른 컴퓨터 입문/활용 제품도 바로 쿠팡에서 확인할 수 있습

www.coupang.com

728x90

저작자표시 비영리 변경금지 (새창열림)

'자격증 > AICE ASSO' 카테고리의 다른 글

정리5 (6)	2025.08.14
정리4 (10)	2025.08.14
정리2 (2)	2025.08.12
정리1 (1)	2025.08.12
[AICE] 자격증 소개 / 비전공자 교재 추천 / 독학 방법 (내돈내산) (12)	2025.08.11

문과쿙의 데이터 분석 모험

정리3

1. 라이브러리 임포트

2. 데이터 불러오기

3. 데이터프레임 구조 확인

4. 주요 과목 시청률 상관관계 분석

5. 이탈 여부에 따른 학습 시간 분포 시각화

6. 이탈 여부별 주요 과목 평균 시청률 계산

7. 결측치 처리

8. 학습 (train) / 평가 (test) 분리

9. Feature 스케일링

10. 랜덤 포레스트 모델 학습

11. XGBoost 모델 학습

12. 모델 성능 평가

13. 딥러닝 분류 모델 설계 및 학습

14. 딥러닝 모델을 이용한 시뮬레이션 예측

'자격증 > AICE ASSO' 카테고리의 다른 글

티스토리툴바

정리3

1. 라이브러리 임포트

2. 데이터 불러오기

3. 데이터프레임 구조 확인

4. 주요 과목 시청률 상관관계 분석

5. 이탈 여부에 따른 학습 시간 분포 시각화

6. 이탈 여부별 주요 과목 평균 시청률 계산

7. 결측치 처리

8. 학습 (train) / 평가 (test) 분리

9. Feature 스케일링

10. 랜덤 포레스트 모델 학습

11. XGBoost 모델 학습

12. 모델 성능 평가

13. 딥러닝 분류 모델 설계 및 학습

14. 딥러닝 모델을 이용한 시뮬레이션 예측

'자격증 > AICE ASSO' 카테고리의 다른 글

관련글

티스토리툴바