Forum Discussion

prasath's avatar
prasath
Helper
7 months ago

API v2 Pagination Error

Hi,

I'm new to the Khoros community and I'm attempting to retrieve users data from February 2023 until the present using pagination with cursor which has around 1.6 million records. The pagination method is generally effective, but at some point, I encountered two different failure with the error messages while running the script at different times.

Note: I'm getting this error after 10/13 iteration.

 

Error: 403 Client Error: Forbidden for url: https://[Domain]/api/2.0/search?q=SELECT%20id,%20sso_id,%20email,%20banned,%20deleted,%20last_visit_time%20FROM%20users%20WHERE%20last_visit_time%20%3E%202023-02-01T00:00:00%20order%20by%20id%20ASC%20limit%2010000%20CURSOR%20'<cursor_value>'
Error: 504 Server Error: Gateway Time-out for url: https://[Domain]/api/2.0/search?q=SELECT%20id,%20sso_id,%20email,%20banned,%20deleted,%20last_visit_time%20FROM%20users%20WHERE%20last_visit_time%20%3E%202023-02-01T00:00:00%20order%20by%20id%20ASC%20limit%2010000%20CURSOR%20'<cursor_value>'

 

 

 


Additionally, I'm curious if there are any limitations on the number of API calls allowed per day. I'd appreciate any assistance in resolving this issue. Below is the script I'm using.

 

 

 

first_url = "https://[COMMUNITY DOMAIN]/api/2.0/search?q=SELECT id, sso_id, email, banned, deleted, last_visit_time FROM users WHERE last_visit_time > 2023-02-01T00:00:00 order by id ASC limit 10000"
def fetch_data(url, headers):
    user_response = requests.get(url, headers=headers)
    user_response.raise_for_status()
    response = user_response.json()
    # print(response)
    print('Response message: ' + str(response['data']['size']))
    print('Response status: ' + str(response['status']))
    # print('Response message: ' + str(response['message']))

    return response['data']


final_df = pd.DataFrame()


def process_data(data):
    # Process the fetched data here
    global final_df# Access the global final_df
    # print(data)
    df = pd.DataFrame(data)
    final_df = pd.concat([final_df, df], ignore_index=True)
    # print('df size', final_df.shape[0])


def paginate(url, headers):
    while True:
        data = fetch_data(url, headers)
        process_data(data['items'])

        # Check if there's a next cursor in the response
        if 'next_cursor' in data:
            # Set the next cursor for the next page request
            url = first_url + f" CURSOR '{data['next_cursor']}'"
            print('### Next URL ###', url)
            # time.sleep(10)

        else:
            # No more pages available, break the loop
            break
paginate(first_url, headers)

 

 

 




  • Hi prasath,

    Thank you for sharing your script and the issues you've encountered. It seems that the error messages you're receiving are related to the pagination limit in the URL. The limit you've set is currently at 10,000 records per page, which might be causing the server to return a 403 Forbidden error or a 504 Gateway Time-out, especially after several iterations.

    To address this, I would recommend adjusting the limit in the URL to 1,000 records per page, which aligns with the pagination best practices for the Khoros community API. This change should help alleviate the issues you're facing with the pagination process.

    Additionally, please keep in mind the rate limits imposed by the Khoros community API. Requests to the Author and Conversation APIs are limited to 60 calls in 60 seconds, while requests to the Analytics Reports API are rate-limited to 20 requests per minute. Ensuring that your script adheres to these rate limits will help prevent rate-limiting errors and ensure smooth API interactions.

    For more information on rate limits and pagination with the Khoros community API, you can refer to the official documentation at the following URLs:

    • prasath's avatar
      prasath
      Helper

       

      Hi vk-9  Thank you for your response. I'm currently facing an issue with extracting 1.6 million records efficiently. Fetching 1000 records per request with my script seems to be taking too much time. I've discovered that the API supports bulk data retrieval, and I've attempted to implement it using the Python example provided in the documentation.
      Document1 
      Document 2 
      However, I'm encountering difficulties in retrieving the specific fields I need, such as id, sso_id, and last_visit_time from the response also getting huge volume of data for one day.

      Regarding the second document, I'm unable to utilize it as I lack the necessary privileges to obtain the username and password required to instantiate the Khoros object.

      Also how can i use 'https://[COMMUNITY DOMAIN]/api/2.0/search?' API to get bulk data,  is this different than 'https://api.lithium.com/lsi-data/v2/data/export/community/ea.prod' ?

      vk-9 , could you please assist me with this? Below is the script I'm currently using for bulk data retrieval.

       

       

      import requests
      import json
      access_token = 'access_token'  # Place the "Bulk API access token" from Community Analytics here
      client_id = 'client_id' # Place the "Client ID" from Community Analytics here
      response = requests.get(
          'https://api.lithium.com/lsi-data/v2/data/export/community/ea.prod',
          params={'fromDate': '20240201', # Set the start date in YYYYMMDD format
                 'toDate': '20240202'},   # Set the end date in YYYYMMDD format
          auth=(access_token, ''),
          headers={'client-id': client_id, # TODO: place the "Client ID" from Community Analytics here
                   'Accept': 'application/json' }  # OPTIONAL: this could also be set to "application/json" to receive JSON results, or even left unset to receive CSV
      	)
      data = response.json() # Use "response.json()" incase you need json response (set 'Accept' header as mentioned above).
      
      
      with open('ahq_user_data.json', 'w', encoding='utf-8') as f:
          json.dump(data, f, ensure_ascii=False, indent=4)

       

       

       

      • vk-9's avatar
        vk-9
        Ace

        Hi prasath ,

        I see you're facing challenges with efficiently extracting 1.6 million records and fetching 1000 records per request is taking too long. You've attempted to implement bulk data retrieval using the Python example provided in the documentation. 

        1. Bulk data retrieval provides all the data within the specified parameters, which means it does not inherently support fetching specific fields. You would need to retrieve the entire dataset and then filter out the specific fields you need from the response data.

        2. If you're unable to instantiate the Khoros object due to lacking privileges, consider reaching out to Khoros support. They can assist you in obtaining the necessary access credentials or provide guidance on how to proceed. Requesting a Khoros support ticket can help address issues related to access permissions and authentication.   

        If You have required permissions   Then try Below procedure: 

        Solution:

        1. Get Bulk Data API Connection Information:

          • Sign into the Community Analytics (formerly known as Lithium Social Intelligence or LSI) user interface.
          • Click on your username in the top-right corner to retrieve your connection information, including:
            • Community ID
            • Client ID
            • Access Token
        2. Supply Connection Information in Python SDK:

          • There are two ways to supply this information:
            • During Instantiation: Supply the connection information when creating the Khoros object. 

         

        bulk_data_settings = {
            'community_id': 'example.prod',
            'client_id': 'ay0CXXXXXXXXXX/XXXX+XXXXXXXXXXXXX/XXXXX4KhQ=',
            'token': '2f25XXXXXXXXXXXXXXXXXXXXXXXXXa10dec04068',
        }
        khoros = Khoros(defined_settings=settings, bulk_data_settings=bulk_data_settings, auto_connect=False)​

         

           3. Query the Bulk Data API:

        • Use the query() method with from_date, to_date, and export_type parameters to perform queries.
        • Example:

         

        results = khoros.bulk_data.query(from_date='20221025', to_date='20221101', export_type='json')​

         

           4. Manipulate Retrieved Data:

        • After querying, manipulate the data using methods like:
          • Filtering by user type
          • Filtering by action
          • Counting actions, logins, and views    

         

        Thank You! 

  • prasath not to my knowledge, no. The API (basically the /search endpoint) will be the way to go to extract user fields. Using a LIMIT 1000 is definitely recommended as you will run into issues when bombarding the API trying to exceed that 1000 limit, a) requests are gonna take very long with that already and b) they'll just time out either from khoros side or from your http client that you are using to send the requests. You can run them concurrently (e.g. parallel), but then I'd highly recommend you implement some logic to deal with quota exceeded errors you'll eventually get back from the API if you overdo it, which you should then retry.

    The only faster way I'm aware of is a direct SQL database dump that you can get from Khoros, but that comes at a pretty steep price (10k $ +) per single export and then you'll have the pleasure of deciphering the Khoros DB-structure to find all the relationships etc. that you might need in the haystack of tables you'll get from that dump. Therefore I'd recommend going the API route...

  • I believe it could be due to offset(next_cursor). The larger the offset value, the more time it could take for your request, it could be the reason why only last three ierations are failing. I am not entirely sure, but you could try reducing the limit to maybe 8000 or 7000 and try again.

    • DasariPrashanth's avatar
      DasariPrashanth
      Helper

       

      I have tried getting only user ids and able to get 80k user ids using for loop with 10 iterations

      • prasath's avatar
        prasath
        Helper

        DasariPrashanth luk 
        Appreciate your assistance. While I've been able to fetch records from the users collection with a limit of 1000 records per request, it's taking approximately 6 to 9 seconds for each request. Considering we have a massive total of 33 million records to extract, I'm seeking a more efficient approach to accomplish this task.

        Also If I iterate through 1000 records at a time, is there a mechanism in place that allows me to resume the process from where it left off in case of failure?