API v2 Pagination Error

Question

Hi,I'm new to the Khoros community and I'm attempting to retrieve users data from February 2023 until the present using pagination with cursor which has around 1.6 million records. The pagination method is generally effective, but at some point, I encountered two different failure with the error messages while running the script at different times.Note: I'm getting this error after 10/13 iteration.&nbsp;Error: 403 Client Error: Forbidden for url: https://[Domain]/api/2.0/search?q=SELECT%20id,%20sso_id,%20email,%20banned,%20deleted,%20last_visit_time%20FROM%20users%20WHERE%20last_visit_time%20%3E%202023-02-01T00:00:00%20order%20by%20id%20ASC%20limit%2010000%20CURSOR%20'&lt;cursor_value&gt;'Error: 504 Server Error: Gateway Time-out for url: https://[Domain]/api/2.0/search?q=SELECT%20id,%20sso_id,%20email,%20banned,%20deleted,%20last_visit_time%20FROM%20users%20WHERE%20last_visit_time%20%3E%202023-02-01T00:00:00%20order%20by%20id%20ASC%20limit%2010000%20CURSOR%20'&lt;cursor_value&gt;'&nbsp;&nbsp;&nbsp;Additionally, I'm curious if there are any limitations on the number of API calls allowed per day. I'd appreciate any assistance in resolving this issue. Below is the script I'm using.&nbsp;&nbsp;&nbsp;first_url = "https://[COMMUNITY DOMAIN]/api/2.0/search?q=SELECT id, sso_id, email, banned, deleted, last_visit_time FROM users WHERE last_visit_time &gt; 2023-02-01T00:00:00 order by id ASC limit 10000"
def fetch_data(url, headers):
    user_response = requests.get(url, headers=headers)
    user_response.raise_for_status()
    response = user_response.json()
    # print(response)
    print('Response message: ' + str(response['data']['size']))
    print('Response status: ' + str(response['status']))
    # print('Response message: ' + str(response['message']))

return response['data']

final_df = pd.DataFrame()

def process_data(data):
    # Process the fetched data here
    global final_df# Access the global final_df
    # print(data)
    df = pd.DataFrame(data)
    final_df = pd.concat([final_df, df], ignore_index=True)
    # print('df size', final_df.shape[0])

def paginate(url, headers):
    while True:
        data = fetch_data(url, headers)
        process_data(data['items'])

# Check if there's a next cursor in the response
        if 'next_cursor' in data:
            # Set the next cursor for the next page request
            url = first_url + f" CURSOR '{data['next_cursor']}'"
            print('### Next URL ###', url)
            # time.sleep(10)

else:
            # No more pages available, break the loop
            break
paginate(first_url, headers)&nbsp;&nbsp;&nbsp;

vamsiu · Answer

Hi&nbsp;prasath,Thank you for sharing your script and the issues you've encountered. It seems that the error messages you're receiving are related to the pagination limit in the URL. The limit you've set is currently at 10,000 records per page, which might be causing the server to return a 403 Forbidden error or a 504 Gateway Time-out, especially after several iterations.To address this, I would recommend adjusting the limit in the URL to 1,000 records per page, which aligns with the pagination best practices for the Khoros community API. This change should help alleviate the issues you're facing with the pagination process.Additionally, please keep in mind the rate limits imposed by the Khoros community API. Requests to the Author and Conversation APIs are limited to 60 calls in 60 seconds, while requests to the Analytics Reports API are rate-limited to 20 requests per minute. Ensuring that your script adheres to these rate limits will help prevent rate-limiting errors and ensure smooth API interactions.For more information on rate limits and pagination with the Khoros community API, you can refer to the official documentation at the following URLs:Pagination with Community API: Documentation LinkRate Limits: Documentation Link&nbsp; &nbsp;If you need any further assistance or have any other questions, feel free to ask!Thank you!

vamsiu · Answer

Hi&nbsp;prasath&nbsp;,I see you're facing challenges with efficiently extracting 1.6 million records and fetching 1000 records per request is taking too long. You've attempted to implement bulk data retrieval using the Python example provided in the documentation.&nbsp;1. Bulk data retrieval provides all the data within the specified parameters, which means it does not inherently support fetching specific fields. You would need to retrieve the entire dataset and then filter out the specific fields you need from the response data.2.&nbsp;If you're unable to instantiate the Khoros object due to lacking privileges, consider reaching out to Khoros support. They can assist you in obtaining the necessary access credentials or provide guidance on how to proceed. Requesting a Khoros support ticket can help address issues related to access permissions and authentication.&nbsp; &nbsp;If You have required permissions&nbsp; &nbsp;Then try Below procedure:&nbsp;Solution:Get Bulk Data API Connection Information:Sign into the Community Analytics (formerly known as Lithium Social Intelligence or LSI) user interface.Click on your username in the top-right corner to retrieve your connection information, including:Community IDClient IDAccess TokenSupply Connection Information in Python SDK:There are two ways to supply this information:During Instantiation: Supply the connection information when creating the Khoros object.&nbsp;&nbsp;bulk_data_settings = {
    'community_id': 'example.prod',
    'client_id': 'ay0CXXXXXXXXXX/XXXX+XXXXXXXXXXXXX/XXXXX4KhQ=',
    'token': '2f25XXXXXXXXXXXXXXXXXXXXXXXXXa10dec04068',
}
khoros = Khoros(defined_settings=settings, bulk_data_settings=bulk_data_settings, auto_connect=False)​&nbsp;&nbsp; &nbsp;3.&nbsp;Query the Bulk Data API:Use the query() method with from_date, to_date, and export_type parameters to perform queries.Example:&nbsp;results = khoros.bulk_data.query(from_date='20221025', to_date='20221101', export_type='json')​&nbsp;&nbsp; &nbsp;4. Manipulate Retrieved Data:After querying, manipulate the data using methods like:Filtering by user typeFiltering by actionCounting actions, logins, and views&nbsp; &nbsp;&nbsp;&nbsp;Thank You!&nbsp;

luk · Answer

I already outlined your options above, and yes, 6-9 secs for a 1000 collection of users is actually relatively "fast" 😉... mine can take much longer per 1000 items...

As I mentioned above, I'm not aware of any "faster", otherwise I would use it... you can get a (paid) DB dump which is much faster to extract data once you figure out the table structure, but yes, it's a snapshot and you have to pay for each snapshot...so for me that was never really a viable option.

I also mentioned that you can run concurrent requests, e.g. send out multiple requests in parallel, let's say 10 at a time that all would ask for 1000 items, that speeds things up a bit, but you'll run into 429 (too many requests) rate limits or worse, so you should implement a good retry mechanism for those requests.

And no, of course there is no "mechanism in place" (where do you have that?) to resume anything if your requests fail, that is up to you to implement as a developer 😉... you could simply do that with some kind of Index you maintain on your side (where you export the data), but there are many other ways of course.

EDIT: I might add that with batch queries (which you seem already familiar with based on the screenshots above) you can technically do 50 requests "in one", meaning the batch query mechanism allows you to do a maximum of 50 requests "bundled" into one, so you could create "batches" of 50 x 1000 users and send that off to the API endpoint adjusting limit and offset for each bundled query like you did above, just be aware of that 50 max limit. Unfortunately this seems only theoretical, I have tried this, e.g. fetching 50k user ID's like this and you will run into a "hard" timeout from Khoros API server at 3 minutes, and for all instances I have tried I never got back a successful response with the first attempt...the Khoros API seems to do some caching in the background, so for example if you try and get 10k users at a time, the first run can take easily 90 seconds, but then if you repeat that call a few times, it can return as fast as in 7 seconds...problem is of course that usually we'd like things to a) succeed the first time and b) don't make the same request a second time...

So probably a hand-optimized strategy of batch-queries and concurrent requests might be the most efficient way to go about this, e.g. try to make it so the total response time for a batch query is not too long (I'd keep it sub 60 seconds if possible) and then balance concurrent batch query requests so you do not hit the 60 requests per 60 seconds limit of the API...I personally do not go over 10k in one batch (that's for id's only...) and lower if I need more fields as responses are gonna get slower and slower the more fields you need. Another problem you might face is that responses get slower the higher your offset is for some reason, so at first it can seem fine to do 25k at once, but as you move on you'll get a ton of 504 gateway timeout errors as your offset increases...

luk · Answer

prasath&nbsp;not to my knowledge, no. The API (basically the /search endpoint) will be the way to go to extract user fields. Using a LIMIT 1000 is definitely recommended as you will run into issues when bombarding the API trying to exceed that 1000 limit, a) requests are gonna take very long with that already and b) they'll just time out either from khoros side or from your http client that you are using to send the requests. You can run them concurrently (e.g. parallel), but then I'd highly recommend you implement some logic to deal with quota exceeded errors you'll eventually get back from the API if you overdo it, which you should then retry.The only faster way I'm aware of is a direct SQL database dump that you can get from Khoros, but that comes at a pretty steep price (10k $ +) per single export and then you'll have the pleasure of deciphering the Khoros DB-structure to find all the relationships etc. that you might need in the haystack of tables you'll get from that dump. Therefore I'd recommend going the API route...

prasath · Answer

&nbsp;Hi&nbsp;vamsiu&nbsp; Thank you for your response. I'm currently facing an issue with extracting 1.6 million records efficiently. Fetching 1000 records per request with my script seems to be taking too much time. I've discovered that the API supports bulk data retrieval, and I've attempted to implement it using the Python example provided in the documentation.Document1&nbsp;Document 2&nbsp;However, I'm encountering difficulties in retrieving the specific fields I need, such as id, sso_id, and last_visit_time from the response also getting huge volume of data for one day.Regarding the second document, I'm unable to utilize it as I lack the necessary privileges to obtain the username and password required to instantiate the Khoros object.Also how can i use 'https://[COMMUNITY DOMAIN]/api/2.0/search?' API to get bulk data,&nbsp; is this different than 'https://api.lithium.com/lsi-data/v2/data/export/community/ea.prod' ?vamsiu&nbsp;, could you please assist me with this? Below is the script I'm currently using for bulk data retrieval.&nbsp;&nbsp;import requests
import json
access_token = 'access_token'  # Place the "Bulk API access token" from Community Analytics here
client_id = 'client_id' # Place the "Client ID" from Community Analytics here
response = requests.get(
    'https://api.lithium.com/lsi-data/v2/data/export/community/ea.prod',
    params={'fromDate': '20240201', # Set the start date in YYYYMMDD format
           'toDate': '20240202'},   # Set the end date in YYYYMMDD format
    auth=(access_token, ''),
    headers={'client-id': client_id, # TODO: place the "Client ID" from Community Analytics here
             'Accept': 'application/json' }  # OPTIONAL: this could also be set to "application/json" to receive JSON results, or even left unset to receive CSV
	)
data = response.json() # Use "response.json()" incase you need json response (set 'Accept' header as mentioned above).

with open('ahq_user_data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)&nbsp;&nbsp;&nbsp;

Forum Discussion

API v2 Pagination Error

9 Replies

Recent Discussions

Cara Hubungu Layanan Wa Tokopedia

Can someone walk me through authenticating and using Postman with Aurora?

How to translate the email template in Khoros production

Introduction to Programming: Building the Future One Line at a Time

change icons on Aurora