Forum Discussion

prasath's avatar
prasath
Helper
11 months ago

API v2 Pagination Error

Hi,

I'm new to the Khoros community and I'm attempting to retrieve users data from February 2023 until the present using pagination with cursor which has around 1.6 million records. The pagination method is generally effective, but at some point, I encountered two different failure with the error messages while running the script at different times.

Note: I'm getting this error after 10/13 iteration.

 

Error: 403 Client Error: Forbidden for url: https://[Domain]/api/2.0/search?q=SELECT%20id,%20sso_id,%20email,%20banned,%20deleted,%20last_visit_time%20FROM%20users%20WHERE%20last_visit_time%20%3E%202023-02-01T00:00:00%20order%20by%20id%20ASC%20limit%2010000%20CURSOR%20'<cursor_value>'
Error: 504 Server Error: Gateway Time-out for url: https://[Domain]/api/2.0/search?q=SELECT%20id,%20sso_id,%20email,%20banned,%20deleted,%20last_visit_time%20FROM%20users%20WHERE%20last_visit_time%20%3E%202023-02-01T00:00:00%20order%20by%20id%20ASC%20limit%2010000%20CURSOR%20'<cursor_value>'

 

 

 


Additionally, I'm curious if there are any limitations on the number of API calls allowed per day. I'd appreciate any assistance in resolving this issue. Below is the script I'm using.

 

 

 

first_url = "https://[COMMUNITY DOMAIN]/api/2.0/search?q=SELECT id, sso_id, email, banned, deleted, last_visit_time FROM users WHERE last_visit_time > 2023-02-01T00:00:00 order by id ASC limit 10000"
def fetch_data(url, headers):
    user_response = requests.get(url, headers=headers)
    user_response.raise_for_status()
    response = user_response.json()
    # print(response)
    print('Response message: ' + str(response['data']['size']))
    print('Response status: ' + str(response['status']))
    # print('Response message: ' + str(response['message']))

    return response['data']


final_df = pd.DataFrame()


def process_data(data):
    # Process the fetched data here
    global final_df# Access the global final_df
    # print(data)
    df = pd.DataFrame(data)
    final_df = pd.concat([final_df, df], ignore_index=True)
    # print('df size', final_df.shape[0])


def paginate(url, headers):
    while True:
        data = fetch_data(url, headers)
        process_data(data['items'])

        # Check if there's a next cursor in the response
        if 'next_cursor' in data:
            # Set the next cursor for the next page request
            url = first_url + f" CURSOR '{data['next_cursor']}'"
            print('### Next URL ###', url)
            # time.sleep(10)

        else:
            # No more pages available, break the loop
            break
paginate(first_url, headers)

 

 

 




9 Replies

  • Hi prasath,

    Thank you for sharing your script and the issues you've encountered. It seems that the error messages you're receiving are related to the pagination limit in the URL. The limit you've set is currently at 10,000 records per page, which might be causing the server to return a 403 Forbidden error or a 504 Gateway Time-out, especially after several iterations.

    To address this, I would recommend adjusting the limit in the URL to 1,000 records per page, which aligns with the pagination best practices for the Khoros community API. This change should help alleviate the issues you're facing with the pagination process.

    Additionally, please keep in mind the rate limits imposed by the Khoros community API. Requests to the Author and Conversation APIs are limited to 60 calls in 60 seconds, while requests to the Analytics Reports API are rate-limited to 20 requests per minute. Ensuring that your script adheres to these rate limits will help prevent rate-limiting errors and ensure smooth API interactions.

    For more information on rate limits and pagination with the Khoros community API, you can refer to the official documentation at the following URLs:

  • prasath's avatar
    prasath
    Helper
    11 months ago

     

    Hi vamsiu  Thank you for your response. I'm currently facing an issue with extracting 1.6 million records efficiently. Fetching 1000 records per request with my script seems to be taking too much time. I've discovered that the API supports bulk data retrieval, and I've attempted to implement it using the Python example provided in the documentation.
    Document1 
    Document 2 
    However, I'm encountering difficulties in retrieving the specific fields I need, such as id, sso_id, and last_visit_time from the response also getting huge volume of data for one day.

    Regarding the second document, I'm unable to utilize it as I lack the necessary privileges to obtain the username and password required to instantiate the Khoros object.

    Also how can i use 'https://[COMMUNITY DOMAIN]/api/2.0/search?' API to get bulk data,  is this different than 'https://api.lithium.com/lsi-data/v2/data/export/community/ea.prod' ?

    vamsiu , could you please assist me with this? Below is the script I'm currently using for bulk data retrieval.

     

     

    import requests
    import json
    access_token = 'access_token'  # Place the "Bulk API access token" from Community Analytics here
    client_id = 'client_id' # Place the "Client ID" from Community Analytics here
    response = requests.get(
        'https://api.lithium.com/lsi-data/v2/data/export/community/ea.prod',
        params={'fromDate': '20240201', # Set the start date in YYYYMMDD format
               'toDate': '20240202'},   # Set the end date in YYYYMMDD format
        auth=(access_token, ''),
        headers={'client-id': client_id, # TODO: place the "Client ID" from Community Analytics here
                 'Accept': 'application/json' }  # OPTIONAL: this could also be set to "application/json" to receive JSON results, or even left unset to receive CSV
    	)
    data = response.json() # Use "response.json()" incase you need json response (set 'Accept' header as mentioned above).
    
    
    with open('ahq_user_data.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

     

     

     

  • vamsiu's avatar
    vamsiu
    Ace
    11 months ago

    Hi prasath ,

    I see you're facing challenges with efficiently extracting 1.6 million records and fetching 1000 records per request is taking too long. You've attempted to implement bulk data retrieval using the Python example provided in the documentation. 

    1. Bulk data retrieval provides all the data within the specified parameters, which means it does not inherently support fetching specific fields. You would need to retrieve the entire dataset and then filter out the specific fields you need from the response data.

    2. If you're unable to instantiate the Khoros object due to lacking privileges, consider reaching out to Khoros support. They can assist you in obtaining the necessary access credentials or provide guidance on how to proceed. Requesting a Khoros support ticket can help address issues related to access permissions and authentication.   

    If You have required permissions   Then try Below procedure: 

    Solution:

    1. Get Bulk Data API Connection Information:

      • Sign into the Community Analytics (formerly known as Lithium Social Intelligence or LSI) user interface.
      • Click on your username in the top-right corner to retrieve your connection information, including:
        • Community ID
        • Client ID
        • Access Token
    2. Supply Connection Information in Python SDK:

      • There are two ways to supply this information:
        • During Instantiation: Supply the connection information when creating the Khoros object. 

     

    bulk_data_settings = {
        'community_id': 'example.prod',
        'client_id': 'ay0CXXXXXXXXXX/XXXX+XXXXXXXXXXXXX/XXXXX4KhQ=',
        'token': '2f25XXXXXXXXXXXXXXXXXXXXXXXXXa10dec04068',
    }
    khoros = Khoros(defined_settings=settings, bulk_data_settings=bulk_data_settings, auto_connect=False)​

     

       3. Query the Bulk Data API:

    • Use the query() method with from_date, to_date, and export_type parameters to perform queries.
    • Example:

     

    results = khoros.bulk_data.query(from_date='20221025', to_date='20221101', export_type='json')​

     

       4. Manipulate Retrieved Data:

    • After querying, manipulate the data using methods like:
      • Filtering by user type
      • Filtering by action
      • Counting actions, logins, and views    

     

    Thank You! 

  • prasath's avatar
    prasath
    Helper
    11 months ago

    Hi vamsiu , I appreciate your assistance.

    Could you please provide guidance on how to achieve bulk data retrieval for the users collection? My understanding is that the two APIs I mentioned yield different results:

    Is there an API available to retrieve bulk data specifically from the users collection?
    basically  I want to achieve the bulk data retrieval for the below request

    https://answers.ea.com:443/api/2.0/search?q=SELECT id, sso_id, email, banned, deleted FROM users WHERE last_visit_time > 2023-02-01T00:00:00 order by id ASC limit 1000

     

  • prasath not to my knowledge, no. The API (basically the /search endpoint) will be the way to go to extract user fields. Using a LIMIT 1000 is definitely recommended as you will run into issues when bombarding the API trying to exceed that 1000 limit, a) requests are gonna take very long with that already and b) they'll just time out either from khoros side or from your http client that you are using to send the requests. You can run them concurrently (e.g. parallel), but then I'd highly recommend you implement some logic to deal with quota exceeded errors you'll eventually get back from the API if you overdo it, which you should then retry.

    The only faster way I'm aware of is a direct SQL database dump that you can get from Khoros, but that comes at a pretty steep price (10k $ +) per single export and then you'll have the pleasure of deciphering the Khoros DB-structure to find all the relationships etc. that you might need in the haystack of tables you'll get from that dump. Therefore I'd recommend going the API route...

  • I believe it could be due to offset(next_cursor). The larger the offset value, the more time it could take for your request, it could be the reason why only last three ierations are failing. I am not entirely sure, but you could try reducing the limit to maybe 8000 or 7000 and try again.

  • DasariPrashanth's avatar
    DasariPrashanth
    Helper
    11 months ago

     

    I have tried getting only user ids and able to get 80k user ids using for loop with 10 iterations

  • prasath's avatar
    prasath
    Helper
    11 months ago

    DasariPrashanth luk 
    Appreciate your assistance. While I've been able to fetch records from the users collection with a limit of 1000 records per request, it's taking approximately 6 to 9 seconds for each request. Considering we have a massive total of 33 million records to extract, I'm seeking a more efficient approach to accomplish this task.

    Also If I iterate through 1000 records at a time, is there a mechanism in place that allows me to resume the process from where it left off in case of failure?

  • luk's avatar
    luk
    Boss
    11 months ago

    I already outlined your options above, and yes, 6-9 secs for a 1000 collection of users is actually relatively "fast" 😉... mine can take much longer per 1000 items...

    As I mentioned above, I'm not aware of any "faster", otherwise I would use it... you can get a (paid) DB dump which is much faster to extract data once you figure out the table structure, but yes, it's a snapshot and you have to pay for each snapshot...so for me that was never really a viable option.

    I also mentioned that you can run concurrent requests, e.g. send out multiple requests in parallel, let's say 10 at a time that all would ask for 1000 items, that speeds things up a bit, but you'll run into 429 (too many requests) rate limits or worse, so you should implement a good retry mechanism for those requests.

    And no, of course there is no "mechanism in place" (where do you have that?) to resume anything if your requests fail, that is up to you to implement as a developer 😉... you could simply do that with some kind of Index you maintain on your side (where you export the data), but there are many other ways of course.

    EDIT: I might add that with batch queries (which you seem already familiar with based on the screenshots above) you can technically do 50 requests "in one", meaning the batch query mechanism allows you to do a maximum of 50 requests "bundled" into one, so you could create "batches" of 50 x 1000 users and send that off to the API endpoint adjusting limit and offset for each bundled query like you did above, just be aware of that 50 max limit. Unfortunately this seems only theoretical, I have tried this, e.g. fetching 50k user ID's like this and you will run into a "hard" timeout from Khoros API server at 3 minutes, and for all instances I have tried I never got back a successful response with the first attempt...the Khoros API seems to do some caching in the background, so for example if you try and get 10k users at a time, the first run can take easily 90 seconds, but then if you repeat that call a few times, it can return as fast as in 7 seconds...problem is of course that usually we'd like things to a) succeed the first time and b) don't make the same request a second time...

    So probably a hand-optimized strategy of batch-queries and concurrent requests might be the most efficient way to go about this, e.g. try to make it so the total response time for a batch query is not too long (I'd keep it sub 60 seconds if possible) and then balance concurrent batch query requests so you do not hit the 60 requests per 60 seconds limit of the API...I personally do not go over 10k in one batch (that's for id's only...) and lower if I need more fields as responses are gonna get slower and slower the more fields you need. Another problem you might face is that responses get slower the higher your offset is for some reason, so at first it can seem fine to do 25k at once, but as you move on you'll get a ton of 504 gateway timeout errors as your offset increases...