Pandas Dataframe writing using tons of RAM with large datasets #291

mattp0 · 2021-07-21T01:44:24Z

Steps to reproduce:
List the minimal actions needed to reproduce the behavior.
This is apart of a Docker deployment of Grafana, Influxdb, and Nginx
Nginx is configured as a reverse proxy and I have configured it with the following config with 10 minute request timeout and max body size disabled

location /influx/ {
    proxy_read_timeout 600;
    proxy_connect_timeout 600;
    proxy_send_timeout 600;
    client_max_body_size 0;
    proxy_pass http://IG-influxdb:8086/;
    proxy_pass_request_headers on;
    proxy_pass_request_body on;
    proxy_set_header Host       $host;
    proxy_set_header X-Real-IP  $remote_addr;
    rewrite ^/influx/(.*) /$1 break;
  }

Load a large dataset into Pandas (about 2.5GB in memory) 140,000 rows x 3000 cols
Create a batch write API call with any parameters (tested it with varying batch sizes same result) and pass the DataFrame into the write.
close write_client

Expected behavior:

I have used the influx 1.x client to always do my pandas to influx processing and this issue does not occur. When making the change to 2.0 I expect large data ingest like this to take some time about 10-15 minutes but to successfully upload to Inlfuxdb. I expect the batch to be sent out as they are completed.

Actual behavior:

The process eventually times out while consuming all 16GB RAM and 4GB of swap on the system.
The retriable error occurred during request. Reason: 'HTTPSConnectionPool(host='192.168.1.200', port=443): Read timed out. (read timeout=3.5777276509998046)'

Specifications:

Client Version: 1.18
InfluxDB Version: 2.0.7
Platform: Ubuntu 20.04

Last thing to note, is if I use a subset of my DataFrame say 10,000 rows the data writes with no issues.

The text was updated successfully, but these errors were encountered:

bednar · 2021-07-21T11:56:25Z

@CyberAngler93, thanks for using our client. We will take a look.

bednar · 2021-07-22T11:45:03Z

@CyberAngler93, the current implementation of our batching API serialize whole DataFrame into Line Protocols. These LineProtocols are used for create batches which is not optimal way how to do it - we will have to improve our WriteApi.

Meanwhile you can use this workaround - split DataFrame into chunks and write chunks by synchronous API:

"""
How to ingest large DataFrame by splitting into chunks.
"""
import math
from datetime import datetime

from influxdb_client import InfluxDBClient
from influxdb_client.client.write.retry import WritesRetry
from influxdb_client.client.write_api import SYNCHRONOUS
from influxdb_client.extras import pd, np

"""
Configuration
"""
url = 'http://localhost:8086'
token = 'my-token'
org = 'my-org'
bucket = 'my-bucket'

dataframe_rows_count = 150_000
chunk_size = 100

"""
Generate Dataframe
"""
print()
print("=== Generating DataFrame ===")
print()

col_data = {
    'time': np.arange(0, dataframe_rows_count, 1, dtype=int),
    'tag': np.random.choice(['tag_a', 'tag_b', 'test_c'], size=(dataframe_rows_count,)),
}
for n in range(2, 2999):
    col_data[f'col{n}'] = np.random.rand(dataframe_rows_count)

data_frame = pd.DataFrame(data=col_data).set_index('time')
print(data_frame)

"""
Split DataFrame into Chunks
"""
print()
print("=== Splitting into chunks ===")
print()
chunks = []
number_of_chunks = int(math.ceil(len(data_frame) / float(chunk_size)))
for chunk_idx in range(number_of_chunks):
    chunks.append(data_frame[chunk_idx * chunk_size:(chunk_idx + 1) * chunk_size])

"""
Write chunks by Synchronous WriteAPI
"""
startTime = datetime.now()
with InfluxDBClient(url=url, token=token, org=org,
                    retries=WritesRetry(total=3, retry_interval=1, exponential_base=2)) as client:
    """
    Use synchronous version of WriteApi to strongly depends on result of write
    """
    write_api = client.write_api(write_options=SYNCHRONOUS)

    for idx, chunk in enumerate(chunks):
        print(f"Writing chunk {idx + 1}/{len(chunks)}...")
        write_api.write(bucket=bucket, record=chunk, data_frame_tag_columns=['tag'],
                        data_frame_measurement_name="measurement_name")

print()
print(f'Import finished in: {datetime.now() - startTime}')
print()

mattp0 · 2021-07-22T18:22:35Z

Will give this method a shot, thank you for the quick reply and code example!

bednar added the bug Something isn't working label Jul 26, 2021

bednar mentioned this issue Jul 26, 2021

feat: dataframe_serializer supports batching #293

Merged

6 tasks

bednar closed this as completed in #293 Jul 29, 2021

bednar added this to the 1.20.0 milestone Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Dataframe writing using tons of RAM with large datasets #291

Pandas Dataframe writing using tons of RAM with large datasets #291

mattp0 commented Jul 21, 2021

bednar commented Jul 21, 2021

bednar commented Jul 22, 2021

mattp0 commented Jul 22, 2021

Pandas Dataframe writing using tons of RAM with large datasets #291

Pandas Dataframe writing using tons of RAM with large datasets #291

Comments

mattp0 commented Jul 21, 2021

bednar commented Jul 21, 2021

bednar commented Jul 22, 2021

mattp0 commented Jul 22, 2021