Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Dataframe writing using tons of RAM with large datasets #291

Closed
mattp0 opened this issue Jul 21, 2021 · 3 comments · Fixed by #293
Closed

Pandas Dataframe writing using tons of RAM with large datasets #291

mattp0 opened this issue Jul 21, 2021 · 3 comments · Fixed by #293
Labels
bug Something isn't working
Milestone

Comments

@mattp0
Copy link

mattp0 commented Jul 21, 2021

Steps to reproduce:
List the minimal actions needed to reproduce the behavior.
This is apart of a Docker deployment of Grafana, Influxdb, and Nginx
Nginx is configured as a reverse proxy and I have configured it with the following config with 10 minute request timeout and max body size disabled

location /influx/ {
    proxy_read_timeout 600;
    proxy_connect_timeout 600;
    proxy_send_timeout 600;
    client_max_body_size 0;
    proxy_pass http://IG-influxdb:8086/;
    proxy_pass_request_headers on;
    proxy_pass_request_body on;
    proxy_set_header Host       $host;
    proxy_set_header X-Real-IP  $remote_addr;
    rewrite ^/influx/(.*) /$1 break;
  }
  1. Load a large dataset into Pandas (about 2.5GB in memory) 140,000 rows x 3000 cols
  2. Create a batch write API call with any parameters (tested it with varying batch sizes same result) and pass the DataFrame into the write.
  3. close write_client

Expected behavior:

I have used the influx 1.x client to always do my pandas to influx processing and this issue does not occur. When making the change to 2.0 I expect large data ingest like this to take some time about 10-15 minutes but to successfully upload to Inlfuxdb. I expect the batch to be sent out as they are completed.

Actual behavior:

The process eventually times out while consuming all 16GB RAM and 4GB of swap on the system.
The retriable error occurred during request. Reason: 'HTTPSConnectionPool(host='192.168.1.200', port=443): Read timed out. (read timeout=3.5777276509998046)'

Specifications:

  • Client Version: 1.18
  • InfluxDB Version: 2.0.7
  • Platform: Ubuntu 20.04

Last thing to note, is if I use a subset of my DataFrame say 10,000 rows the data writes with no issues.

@bednar
Copy link
Contributor

bednar commented Jul 21, 2021

@CyberAngler93, thanks for using our client. We will take a look.

@bednar
Copy link
Contributor

bednar commented Jul 22, 2021

@CyberAngler93, the current implementation of our batching API serialize whole DataFrame into Line Protocols. These LineProtocols are used for create batches which is not optimal way how to do it - we will have to improve our WriteApi.

Meanwhile you can use this workaround - split DataFrame into chunks and write chunks by synchronous API:

"""
How to ingest large DataFrame by splitting into chunks.
"""
import math
from datetime import datetime

from influxdb_client import InfluxDBClient
from influxdb_client.client.write.retry import WritesRetry
from influxdb_client.client.write_api import SYNCHRONOUS
from influxdb_client.extras import pd, np

"""
Configuration
"""
url = 'http://localhost:8086'
token = 'my-token'
org = 'my-org'
bucket = 'my-bucket'

dataframe_rows_count = 150_000
chunk_size = 100

"""
Generate Dataframe
"""
print()
print("=== Generating DataFrame ===")
print()

col_data = {
    'time': np.arange(0, dataframe_rows_count, 1, dtype=int),
    'tag': np.random.choice(['tag_a', 'tag_b', 'test_c'], size=(dataframe_rows_count,)),
}
for n in range(2, 2999):
    col_data[f'col{n}'] = np.random.rand(dataframe_rows_count)

data_frame = pd.DataFrame(data=col_data).set_index('time')
print(data_frame)

"""
Split DataFrame into Chunks
"""
print()
print("=== Splitting into chunks ===")
print()
chunks = []
number_of_chunks = int(math.ceil(len(data_frame) / float(chunk_size)))
for chunk_idx in range(number_of_chunks):
    chunks.append(data_frame[chunk_idx * chunk_size:(chunk_idx + 1) * chunk_size])

"""
Write chunks by Synchronous WriteAPI
"""
startTime = datetime.now()
with InfluxDBClient(url=url, token=token, org=org,
                    retries=WritesRetry(total=3, retry_interval=1, exponential_base=2)) as client:
    """
    Use synchronous version of WriteApi to strongly depends on result of write
    """
    write_api = client.write_api(write_options=SYNCHRONOUS)

    for idx, chunk in enumerate(chunks):
        print(f"Writing chunk {idx + 1}/{len(chunks)}...")
        write_api.write(bucket=bucket, record=chunk, data_frame_tag_columns=['tag'],
                        data_frame_measurement_name="measurement_name")

print()
print(f'Import finished in: {datetime.now() - startTime}')
print()

@mattp0
Copy link
Author

mattp0 commented Jul 22, 2021

Will give this method a shot, thank you for the quick reply and code example!

@bednar bednar added the bug Something isn't working label Jul 26, 2021
@bednar bednar added this to the 1.20.0 milestone Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants