Python에서 대용량 파일을 읽기 위한 느린 방법?

programing

Python에서 대용량 파일을 읽기 위한 느린 방법?

shortcode 2022. 9. 18. 17:52

Python에서 대용량 파일을 읽기 위한 느린 방법?

저는 4GB의 큰 파일을 가지고 있는데 읽으려고 하면 컴퓨터가 정지합니다.그래서 나는 그것을 하나하나 읽고 각 조각이 처리된 조각들을 다른 파일에 저장하고 다음 조각을 읽고 싶다.

할 수 있는 방법이 있나요?yield 조각들 이 조각들?

저는 게으른 방법을 갖고 싶어요.

느린 함수를 쓰려면 다음 명령을 사용합니다.

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open('really_big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

다른 옵션은 및 도우미 기능을 사용하는 것입니다.

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)

파일이 회선 기반인 경우 파일 개체는 이미 느린 회선 생성기입니다.

for line in open('really_big_file.dat'):
    process_data(line)

file.readlines()【크기】【크기】【크기】【크기】이 인수는 반환되는 행에서 읽은 행의 개수에 근사합니다.

bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)

이미 많은 좋은 답변이 있지만 파일 전체가 한 줄에 있는데도 고정 크기 블록이 아닌 "행"을 처리하려는 경우 이러한 답변은 도움이 되지 않습니다.

99%의 경우 파일을 한 줄씩 처리할 수 있습니다.그런 다음 이 답변에서 제안된 대로 파일 개체 자체를 느린 생성기로 사용할 수 있습니다.

with open('big.csv') as f:
    for line in f:
        process(line)

행 큰할 수 .'\n'는 다음과 같습니다.'|'를 참조해 주세요.

'|'로로 합니다.'\n' 전 할 수 는 「처리 전」이라고 하는 수 입니다.이는 합법적으로 포함할 수 있는 필드를 혼란시킬 수 있기 때문입니다.'\n'(그림: '아예').
적어도 이전 버전의 lib에서는 입력을 한 줄씩 읽도록 하드 코딩되어 있기 때문에 csv 라이브러리를 사용하는 것도 제외됩니다.

이러한 상황에 대비하여 다음과 같은 스니펫을 만들었습니다.[Python 3.8+용 2021년 5월 업데이트]

def rows(f, chunksize=1024, sep='|'):
    """
    Read a file where the row separator is '|' lazily.

    Usage:

    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(r)
    """
    row = ''
    while (chunk := f.read(chunksize)) != '':   # End of file
        while (i := chunk.find(sep)) != -1:     # No separator found
            yield row + chunk[:i]
            chunk = chunk[i+1:]
            row = ''
        row += chunk
    yield row

[구버전 python의 경우]

def rows(f, chunksize=1024, sep='|'):
    """
    Read a file where the row separator is '|' lazily.

    Usage:

    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(r)
    """
    curr_row = ''
    while True:
        chunk = f.read(chunksize)
        if chunk == '': # End of file
            yield curr_row
            break
        while True:
            i = chunk.find(sep)
            if i == -1:
                break
            yield curr_row + chunk[:i]
            curr_row = ''
            chunk = chunk[i+1:]
        curr_row += chunk

저는 그것을 성공적으로 사용하여 여러 가지 문제를 해결할 수 있었습니다.다양한 청크 크기로 광범위하게 테스트되었습니다.자신을 납득시킬 필요가 있는 사용자를 위해 사용하고 있는 테스트 스위트는 다음과 같습니다.

test_file = 'test_file'

def cleanup(func):
    def wrapper(*args, **kwargs):
        func(*args, **kwargs)
        os.unlink(test_file)
    return wrapper

@cleanup
def test_empty(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1_char_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_1_char(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1025_chars_1_row(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1024_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1023):
            f.write('a')
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1026

@cleanup
def test_2048_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_2049_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

if __name__ == '__main__':
    for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
        test_empty(chunksize)
        test_1_char_2_rows(chunksize)
        test_1_char(chunksize)
        test_1025_chars_1_row(chunksize)
        test_1024_chars_2_rows(chunksize)
        test_1025_chars_1026_rows(chunksize)
        test_2048_chars_2_rows(chunksize)
        test_2049_chars_2_rows(chunksize)

컴퓨터, OS 및 파이썬이 64비트인 경우 mmap 모듈을 사용하여 파일 내용을 메모리에 매핑하고 인덱스와 슬라이스로 액세스할 수 있습니다.다음으로 설명서의 예를 제시하겠습니다.

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()

컴퓨터, OS 또는 파이썬 중 하나가 32비트인 경우 mmaping 대용량 파일은 주소 공간의 대부분을 예약하여 프로그램에서 메모리를 부족하게 만들 수 있습니다.

f = ... # file-like object, i.e. supporting read(size) function and 
        # returning empty string '' when there is nothing to read

def chunked(file, chunk_size):
    return iter(lambda: file.read(chunk_size), '')

for data in chunked(f, 65536):
    # process the data

업데이트: 이 접근법에 대한 자세한 내용은 https://stackoverflow.com/a/4566523/38592를 참조하십시오.

파이썬 3.8 이상에서는while루프:

with open("somefile.txt") as f:
    while chunk := f.read(8192):
        do_something(chunk)

물론 원하는 청크기를 사용할 수 있습니다.8192(2**13) 바이트입니다.파일 크기가 청크 크기의 배수가 아닌 한 마지막 청크는 청크 크기보다 작습니다.

python의 공식 문서를 참조하십시오.https://docs.python.org/3/library/functions.html#iter

아마도 이 방법은 더 비단어적인 방법일 것이다.

"""A file object returned by open() is a iterator with
read method which could specify current read's block size
"""
with open('mydata.db', 'r') as f_in:
    block_read = partial(f_in.read, 1024 * 1024)
    block_iterator = iter(block_read, '')

    for index, block in enumerate(block_iterator, start=1):
        block = process_block(block)  # process your block data

        with open(f'{index}.txt', 'w') as f_out:
            f_out.write(block)

이렇게 쓸 수 있을 것 같아요.

def read_file(path, block_size=1024): 
    with open(path, 'rb') as f: 
        while True: 
            piece = f.read(block_size) 
            if piece: 
                yield piece 
            else: 
                return

for piece in read_file(path):
    process_piece(piece)

평판이 낮기 때문에 코멘트는 할 수 없지만 Silent Ghosts 솔루션은 file.readlines([sizehint])를 사용하면 훨씬 쉬워집니다.

python 파일 메서드

edit: Silent Ghost는 맞지만 다음보다 더 나은 방법이 될 것입니다.

s = "" 
for i in xrange(100): 
   s += file.next()

저도 비슷한 상황이에요.청크기(바이트)를 알고 있는지 여부는 확실하지 않습니다.보통 잘 모르지만 필요한 레코드(행)의 수는 알고 있습니다.

def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]

업데이트: 감사합니다.제 말은 이렇습니다.거의 작동하지만, '사이의' 덩어리가 손실됩니다.

chunk = [next(gen) for i in range(lines_required)]

낚싯줄 하나라도 끊기면 트릭이 나오긴 하는데 별로 안 좋아 보이네

다음 코드를 사용할 수 있습니다.

file_obj = open('big_file')

open()은 파일개체를 반환합니다.

크기를 가져오려면 os.stat을 사용합니다.

file_size = os.stat('big_file').st_size

for i in range( file_size/1024):
    print file_obj.read(1024)

언급URL : https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python

저작자표시

'programing' 카테고리의 다른 글

MySQL > 테이블이 존재하지 않습니다.하지만 그렇다(혹은 그래야 한다) (0)	2022.09.18
PHP에서 지정된 키의 값에 따라 연관 배열 배열을 정렬하려면 어떻게 해야 합니까? (0)	2022.09.18
열을 변경하고 기본값을 변경하려면 어떻게 해야 합니까? (0)	2022.09.18
복합 인덱스 및 영구 열을 사용한 MySQL 쿼리 최적화 (0)	2022.09.18
Java에서 루프를 위한 확장의 마지막 반복 (0)	2022.09.18

현재글Python에서 대용량 파일을 읽기 위한 느린 방법?

각종 프로그래밍 정보를 다루는 블로그입니다.

Spring3, spring, c#, JQuery, c++, java, Javascript,

Today :
Yesterday :

shortcode

Python에서 대용량 파일을 읽기 위한 느린 방법?

Python에서 대용량 파일을 읽기 위한 느린 방법?

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Python에서 대용량 파일을 읽기 위한 느린 방법?

Python에서 대용량 파일을 읽기 위한 느린 방법?

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바