[Python] Pandas 기본


Python으로 머신러닝을 하면서 자꾸 잊어버리게 되는 Pandas의 문법 메모해놓았다.

Pandas Import

import pandas as pd
df = pd.read_csv("./data/titanic/train.csv")
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

Data Type

type(df)
pandas.core.frame.DataFrame
  • DataFrame 타입임이 명시됨

Data Shape

df.shape
(891, 12)
df.shape[0] # Row수 출력
891
df.shape[1] # Column수 출력
12

Data info

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
  1. 전체 컬럼수 파악가능
  2. Int타입과 Object 타입 확인가능, 여기서 Object는 문자열 타입이라고 생각하면 됨
  3. Null개수 파악 가능

Data Describe

df.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

Value Counts

  • Series 타입의 데이터를 줘야 값을 반환함
  • 분포 확인에 적합
df["Survived"].value_counts()
0    549
1    342
Name: Survived, dtype: int64

DataFrame, List, ndarray 상호변환

List, ndarray, dict -> DataFrame

import numpy as np

col_name1 = ["col"]
list1 = [1,2,3]
array1 = np.array(list1)
print("array1 shape:", array1.shape)

#list를 이용한 데이터프레임
dt_list1 = pd.DataFrame(list1, columns=col_name1)
print("1차원 리스트 데이터 프레임 : \n", dt_list1)

#ndarray -> DataFrame
df_array1 = pd.DataFrame(array1, columns=col_name1)
print("1차원 ndarray 데이터 프레임 : \n", df_array1)

#dictionary -> DataFrame
dict = {"col1" : [1,2], "col2" : [3,4], "col3" : [5,6]}
df_dict = pd.DataFrame(dict)
print("1차원 dictionary 데이터 프레임 : \n", df_dict)
array1 shape: (3,)
1차원 리스트 데이터 프레임 : 
    col
0    1
1    2
2    3
1차원 ndarray 데이터 프레임 : 
    col
0    1
1    2
2    3
1차원 dictionary 데이터 프레임 : 
    col1  col2  col3
0     1     3     5
1     2     4     6

DataFrame -> List, ndarray, dict

머신러닝 패키지는 데이터 타입을 numpy의 객체인 array로 받는 경우가 많으므로, DataFrame을 다시 array로 전환해주는 방법을 알아야함

# DataFrame -> Array
# Values 함수

array3 = df_dict.values
print(array3)

# DataFrame -> List
list3 = df_dict.values.tolist()
print(list3)

# DataFrame -> Dictionary
dict3 = df_dict.to_dict("list")
print(dict3)
[[1 3 5]
 [2 4 6]]
[[1, 3, 5], [2, 4, 6]]
{'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}

DataFrame 데이터 삭제

  • drop 함수를 사용함
  • axis=0은 특정 로우를 드롭, axis=1은 특정 컬럼을 드롭한다는 의미
  • 항상 axis와 같이 사용함
  • 삭제한 데이터를 원본데이터에 저장하고 싶으면, inplace=True인자를 입력
df.drop("Pclass", axis=1)
PassengerIdNameSexAgeSibSpParchTicketFareCabinEmbarked
01Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
12Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
23Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
34Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
45Allen, Mr. William Henrymale35.0003734508.0500NaNS
.................................
886887Montvila, Rev. Juozasmale27.00021153613.0000NaNS
887888Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
888889Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
889890Behr, Mr. Karl Howellmale26.00011136930.0000C148C
890891Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 10 columns

df.drop("Pclass", axis=1, inplace=True)
# PClass 컬럼 삭제됨

Index 객체

  • DB에서 PK의 역할을 수행
  • 슬라이싱 가능 -> [:,]을 이용 가능
  • Index 값을 임의로 변경할 수 없음
# 원본 재로드
df = pd.read_csv("./data/titanic/train.csv")
df.head(3)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
# Index 개체 추출
indexes = df.index
print(indexes)

# Array로 변환
# indexes.values

# list로 반환
# indexes.values.tolist()
RangeIndex(start=0, stop=891, step=1)

Reset Index

df.reset_index().head(3)
indexPassengerIdfake_ageSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00188.003Braund, Mr. Owen Harrismale22.010A/5 211717.2500C000S
112152.011Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
223104.013Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250C000S

결손 데이터 처리

  • isna함수와 fillna함수를 기억
# Null 데이터확인
df.isna().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64
# fillna 결손데이터 대체
df["Cabin"] = df["Cabin"].fillna("C000")
df.head(3)
PassengerIdfake_ageSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0188.003Braund, Mr. Owen Harrismale22.010A/5 211717.2500C000S
12152.011Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
23104.013Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250C000S

데이터 Reshaping

  • reshape를 사용해서 데이터의 형태를 변형
  • Scaler를 적용할 때 사용함
import seaborn as sns
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
test = scaler.fit_transform(df["Fare"].values.reshape(-1,1))
pd.DataFrame(test, columns=["test_column"]).head(3)
test_column
0-0.502445
10.786845
2-0.488854

Data Inserting

fake_age = df["Age"]*4
fake_age
df.insert(1, "fake_age",fake_age)
# df.drop("fake_age", axis=1, inplace=True)
df.head(3)
PassengerIdfake_ageSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0188.003Braund, Mr. Owen Harrismale22.010A/5 211717.2500C000S
12152.011Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
23104.013Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250C000S



© 2021.04. by Jessie

Powered by jessie