Python으로 머신러닝을 하면서 자꾸 잊어버리게 되는 Pandas의 문법 메모해놓았다.

Pandas Import

import pandas as pd

df = pd.read_csv("./data/titanic/train.csv")
df.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

Data Type

type(df)

pandas.core.frame.DataFrame

DataFrame 타입임이 명시됨

Data Shape

df.shape

(891, 12)

df.shape[0] # Row수 출력

df.shape[1] # Column수 출력

Data info

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

전체 컬럼수 파악가능
Int타입과 Object 타입 확인가능, 여기서 Object는 문자열 타입이라고 생각하면 됨
Null개수 파악 가능

Data Describe

df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Value Counts

Series 타입의 데이터를 줘야 값을 반환함
분포 확인에 적합

df["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

DataFrame, List, ndarray 상호변환

List, ndarray, dict -> DataFrame

import numpy as np

col_name1 = ["col"]
list1 = [1,2,3]
array1 = np.array(list1)
print("array1 shape:", array1.shape)

#list를 이용한 데이터프레임
dt_list1 = pd.DataFrame(list1, columns=col_name1)
print("1차원 리스트 데이터 프레임 : \n", dt_list1)

#ndarray -> DataFrame
df_array1 = pd.DataFrame(array1, columns=col_name1)
print("1차원 ndarray 데이터 프레임 : \n", df_array1)

#dictionary -> DataFrame
dict = {"col1" : [1,2], "col2" : [3,4], "col3" : [5,6]}
df_dict = pd.DataFrame(dict)
print("1차원 dictionary 데이터 프레임 : \n", df_dict)

array1 shape: (3,)
1차원 리스트 데이터 프레임 : 
    col
0    1
1    2
2    3
1차원 ndarray 데이터 프레임 : 
    col
0    1
1    2
2    3
1차원 dictionary 데이터 프레임 : 
    col1  col2  col3
0     1     3     5
1     2     4     6

DataFrame -> List, ndarray, dict

머신러닝 패키지는 데이터 타입을 numpy의 객체인 array로 받는 경우가 많으므로, DataFrame을 다시 array로 전환해주는 방법을 알아야함

# DataFrame -> Array
# Values 함수

array3 = df_dict.values
print(array3)

# DataFrame -> List
list3 = df_dict.values.tolist()
print(list3)

# DataFrame -> Dictionary
dict3 = df_dict.to_dict("list")
print(dict3)

[[1 3 5]
 [2 4 6]]
[[1, 3, 5], [2, 4, 6]]
{'col1': [1, 2], 'col2': [3, 4], 'col3': [5, 6]}

DataFrame 데이터 삭제

drop 함수를 사용함
axis=0은 특정 로우를 드롭, axis=1은 특정 컬럼을 드롭한다는 의미
항상 axis와 같이 사용함
삭제한 데이터를 원본데이터에 저장하고 싶으면, inplace=True인자를 입력

df.drop("Pclass", axis=1)

	PassengerId	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...
886	887	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 10 columns

df.drop("Pclass", axis=1, inplace=True)
# PClass 컬럼 삭제됨

Index 객체

DB에서 PK의 역할을 수행
슬라이싱 가능 -> [:,]을 이용 가능
Index 값을 임의로 변경할 수 없음

# 원본 재로드
df = pd.read_csv("./data/titanic/train.csv")
df.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

# Index 개체 추출
indexes = df.index
print(indexes)

# Array로 변환
# indexes.values

# list로 반환
# indexes.values.tolist()

RangeIndex(start=0, stop=891, step=1)

Reset Index

df.reset_index().head(3)

	index	PassengerId	fake_age	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	1	88.0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	C000	S
1	1	2	152.0	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	2	3	104.0	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	C000	S

결손 데이터 처리

isna함수와 fillna함수를 기억

# Null 데이터확인
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         2
dtype: int64

# fillna 결손데이터 대체
df["Cabin"] = df["Cabin"].fillna("C000")
df.head(3)

	PassengerId	fake_age	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	88.0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	C000	S
1	2	152.0	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	104.0	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	C000	S

데이터 Reshaping

reshape를 사용해서 데이터의 형태를 변형
Scaler를 적용할 때 사용함

import seaborn as sns
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
test = scaler.fit_transform(df["Fare"].values.reshape(-1,1))
pd.DataFrame(test, columns=["test_column"]).head(3)

	test_column
0	-0.502445
1	0.786845
2	-0.488854

Data Inserting

fake_age = df["Age"]*4
fake_age
df.insert(1, "fake_age",fake_age)
# df.drop("fake_age", axis=1, inplace=True)
df.head(3)

	PassengerId	fake_age	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	88.0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	C000	S
1	2	152.0	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	104.0	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	C000	S

[Python] Pandas 기본

Pandas Import

Data Type

Data Shape

Data info

Data Describe

Value Counts

DataFrame, List, ndarray 상호변환

List, ndarray, dict -> DataFrame

DataFrame -> List, ndarray, dict

DataFrame 데이터 삭제

Index 객체

Reset Index

결손 데이터 처리

데이터 Reshaping

Data Inserting

제씨의 블로그