Pandas DataFrame에 한 행 추가

developer tip

Pandas DataFrame에 한 행 추가

optionbox 2020. 9. 30. 10:24

Pandas DataFrame에 한 행 추가

pandas는 완전히 채워지도록 설계 DataFrame되었지만 빈 DataFrame을 만든 다음 행을 하나씩 추가 해야합니다 . 이를 수행하는 가장 좋은 방법은 무엇입니까?

다음을 사용하여 빈 DataFrame을 성공적으로 만들었습니다.

res = DataFrame(columns=('lib', 'qty1', 'qty2'))

그런 다음 새 행을 추가하고 다음으로 필드를 채울 수 있습니다.

res = res.set_value(len(res), 'qty1', 10.0)

작동하지만 매우 이상하게 보입니다 :-/ (문자열 값을 추가하지 못함)

다른 열 유형을 사용하여 DataFrame에 새 행을 어떻게 추가 할 수 있습니까?

>>> import pandas as pd
>>> from numpy.random import randint

>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>>     df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))

>>> df
     lib qty1 qty2
0  name0    3    3
1  name1    2    4
2  name2    2    8
3  name3    2    1
4  name4    9    6

데이터 프레임에 대한 모든 데이터를 미리 가져올 수있는 경우 데이터 프레임에 추가하는 것보다 훨씬 빠른 방법이 있습니다.

각 사전이 입력 데이터 행에 해당하는 사전 목록을 만듭니다.
이 목록에서 데이터 프레임을 만듭니다.

행별로 데이터 프레임을 추가하는 데 30 분이 걸리고 몇 초 내에 완료된 사전 목록에서 데이터 프레임을 만드는 유사한 작업이있었습니다.

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)

당신은 사용할 수 있습니다 pandas.concat()또는 DataFrame.append(). 자세한 내용과 예는 병합, 조인 및 연결을 참조하십시오 .

오랜만에 나도 같은 문제에 직면했다. 그리고 여기에서 많은 흥미로운 답변을 찾았습니다. 그래서 어떤 방법을 사용할지 혼란 스러웠습니다.

데이터 프레임에 많은 행을 추가하는 경우 속도 성능에 관심 이 있습니다. 그래서 가장 인기있는 4 가지 방법을 시도해보고 속도를 확인했습니다.

새 버전의 패키지를 사용하여 2019 년에 업데이트되었습니다 . @FooBar 댓글 이후에도 업데이트 됨

속도 성능

.append 사용 ( NPE의 대답 )
.loc 사용 ( fred의 대답 )
사전 할당과 함께 .loc 사용 ( FooBar의 답변 )
dict를 사용하고 결국 DataFrame 만들기 ( ShikharDua의 답변 )

결과 (초) :

|------------|-------------|-------------|-------------|
|  Approach  |  1000 rows  |  5000 rows  | 10 000 rows |
|------------|-------------|-------------|-------------|
| .append    |    0.69     |    3.39     |    6.78     |
|------------|-------------|-------------|-------------|
| .loc w/o   |    0.74     |    3.90     |    8.35     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
| .loc with  |    0.24     |    2.58     |    8.70     |
| prealloc   |             |             |             |
|------------|-------------|-------------|-------------|
|  dict      |    0.012    |   0.046     |   0.084     |
|------------|-------------|-------------|-------------|

또한 유용한 의견 을 주신 @krassowski 에게 감사드립니다 -코드를 업데이트했습니다.

그래서 저는 사전을 통해 덧셈을 사용합니다.

암호:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

추신 : 제 실현이 완벽하지 않고 최적화가있을 수 있다고 생각합니다.

사전에 항목 수를 알고있는 경우 인덱스도 제공하여 공간을 미리 할당해야합니다 (다른 답변에서 데이터 예제 가져 오기).

import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )

# now fill it up row by row
for x in np.arange(0, numberOfRows):
    #loc or iloc both work here since the index is natural numbers
    df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]: 
   lib  qty1  qty2
0   -1    -1    -1
1    0     0     0
2   -1     0    -1
3    0    -1     0
4   -1     0     0

속도 비교

In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, @fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop

그리고 코멘트에서와 같이 6000의 크기로 속도 차이가 더 커집니다.

어레이의 크기 (12)와 행 수 (500)를 늘리면 속도 차이가 더욱 두드러집니다. 313ms 대 2.29s

효율적인 추가 를 위해 pandas 데이터 프레임에 추가 행을 추가하는 방법 및 확대 설정 을 참조하십시오 .

를 통해 행을 추가 loc/ix에 비 기존의 키 인덱스 데이터. 예 :

In [1]: se = pd.Series([1,2,3])

In [2]: se
Out[2]: 
0    1
1    2
2    3
dtype: int64

In [3]: se[5] = 5.

In [4]: se
Out[4]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

또는:

In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
   .....:                 columns=['A','B'])
   .....: 

In [2]: dfi
Out[2]: 
   A  B
0  0  1
1  2  3
2  4  5

In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']

In [4]: dfi
Out[4]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
In [5]: dfi.loc[3] = 5

In [6]: dfi
Out[6]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5

mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
    df.loc[len(df)] = row

ignore_index옵션을 사용하여 단일 행을 사전으로 추가 할 수 있습니다 .

>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
  Animal Color
0    cow  blue
1  horse   red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
  Animal  Color
0    cow   blue
1  horse    red
2  mouse  black

Pythonic 방식을 위해 여기에 내 대답을 추가하십시오.

res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())

   lib  qty1  qty2
0  NaN  10.0   NaN

목록 목록을 작성하고 데이터 프레임으로 변환 할 수도 있습니다.

import pandas as pd

columns = ['i','double','square']
rows = []

for i in range(6):
    row = [i, i*2, i*i]
    rows.append(row)

df = pd.DataFrame(rows, columns=columns)

기부

이것은 OP 질문에 대한 대답이 아니라 위에서 매우 유용하다고 생각한 @ShikharDua의 대답을 설명하는 장난감 예제입니다.

While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the stats below for more than one taget column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you @ShikharDua !

import pandas as pd 

BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
                          'Territory'  : ['West','East','South','West','East','South'],
                          'Product'  : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData

columns = ['Customer','Num Unique Products', 'List Unique Products']

rows_list=[]
for name, group in BaseData.groupby('Customer'):
    RecordtoAdd={} #initialise an empty dict 
    RecordtoAdd.update({'Customer' : name}) #
    RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})      
    RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})                   

    rows_list.append(RecordtoAdd)

AnalysedData = pd.DataFrame(rows_list)

print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)

Figured out a simple and nice way:

>>> df
     A  B  C
one  1  2  3
>>> df.loc["two"] = [4,5,6]
>>> df
     A  B  C
one  1  2  3
two  4  5  6

Create a new record(data frame) and add to old_data_frame.
pass list of values and corresponding column names to create a new_record (data_frame)

new_record = pd.DataFrame([[0,'abcd',0,1,123]],columns=['a','b','c','d','e'])

old_data_frame = pd.concat([old_data_frame,new_record])

Here is the way to add/append a row in pandas DataFrame

def add_row(df, row):
    df.loc[-1] = row
    df.index = df.index + 1  
    return df.sort_index()

add_row(df, [1,2,3])

It can be used to insert/append a row in empty or populated pandas DataFrame

Another way to do it (probably not very performant):

# add a row
def add_row(df, row):
    colnames = list(df.columns)
    ncol = len(colnames)
    assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
    return df.append(pd.DataFrame([row], columns=colnames))

You can also enhance the DataFrame class like this:

import pandas as pd
def add_row(self, row):
    self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row

Make it simple. By taking list as input which will be appended as row in data-frame:-

import pandas as pd  
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))  
for i in range(5):  
    res_list = list(map(int, input().split()))  
    res = res.append(pd.Series(res_list,index=['lib','qty1','qty2']), ignore_index=True)

We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far. But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.

This will take care of adding an item to an empty DataFrame. The issue is that df.index.max() == nan for the first index:

df = pd.DataFrame(columns=['timeMS', 'accelX', 'accelY', 'accelZ', 'gyroX', 'gyroY', 'gyroZ'])

df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = [x for x in range(7)]

참고URL : https://stackoverflow.com/questions/10715965/add-one-row-to-pandas-dataframe

'developer tip' 카테고리의 다른 글

입력 유형 = "날짜"형식을 변경하는 방법이 있습니까? (0)	2020.09.30
클래스 JSON을 직렬화 가능하게 만드는 방법 (0)	2020.09.30
자바 스크립트에서 localStorage를 지우시겠습니까? (0)	2020.09.30
Jackson을 사용하여 객체 배열을 역 직렬화하는 방법 (0)	2020.09.30
DEX를 Java 소스 코드로 디 컴파일 (0)	2020.09.30

현재글Pandas DataFrame에 한 행 추가

optionbox

Pandas DataFrame에 한 행 추가

Pandas DataFrame에 한 행 추가

속도 성능

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

Pandas DataFrame에 한 행 추가

Pandas DataFrame에 한 행 추가

속도 성능

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바