Understanding Support Patterns #
Finding patterns in the support cases helps to find structural issues and bottlenecks in the data platform.
Just after BigQuery released the ML.GENERATE_TEXT
functions (mid 2024) and made Gemini accessible though BigQuery, I was curios to to test this new feature. And what better use case then understanding Support Patterns and issues with the support on the fly.
Thus, I tried these features by analyzing a sample of our internal support tickets.
The basic process was quite simple and the final report turned out to be really insightful. It helped us to eliminate a good 20 percent of support tickets, not by introducing AI in support, but by making small changes to our documentation and support process.
The high level steps are:
- selecting a meaningful but not too large number of tickets to not waste to many tokens
- summarizing the tickets, extracting the relevant dimensions (one LLM call per ticket)
- concatenating all the summaries
- prompt for a summary in a final large LLM call
Trying to do this in a single step is not recommendable:
- the concatenated text is becomes long and you can easily reach the token limit
- a lot of tokens are wasted when trying to refine the final summarization prompt
- some interpretation of the issues on ticket level already allows to create a structured dataframe that can be used in further exploratory data analysis.
Select Sample Tickets #
SELECT
TO_JSON_STRING(t) AS text
FROM
table_with_all_tickets
WHERE
created_date >= "2024-01-01"
and RAND() < 0.04 -- select an appropriate ratio by calculating e.g. 1000/number_of_tickets
Summarize Tickets #
The summary prompt used was:
Summary Prompt
You are an IT support expert tasked with summarizing support requests for analysis to improve customer support. For each JSON formatted request, generate a concise summary in JSON format, focusing on themes and information relevant to improving support processes rather than individual request details. Pay particular attention to clarifying requests with generic customer_request_type
values like “Other” or when it is null. Utilize all available fields to deduce the true nature of the requests.
Output the following for each request:
- description: A concise summary of the core issue.
- request_type: The
customer_request_type
value, clarified if generic or null. - request_origin: Infer the request origin if not explicitly provided (e.g., from
reporter
,logged_via
, etc.). - logged_via: The
logged_via
value. - assignment_group: The
assignment_group
value. - client_information: Relevant client details, including
customer_location
,territory
, or any deducible information (e.g., fromproject
,components
, etc.). If no information is available, explicitly state “Unknown”. - key_themes: Important recurring themes/keywords from
description
andsummary
. Crucially, identify the true nature of requests with genericcustomer_request_type
values. - resolution: The
resolution
value. - status: The
status
value. - sla_information: A dictionary containing:
- breached:
sla_time_to_resolution_breached
(true/false). - paused:
sla_time_to_resolution_paused
(true/false). - elapsed_time:
sla_time_to_resolution_elapsed_time
. - remaining_time:
sla_time_to_resolution_remaining_time
. - outlier: (true/false) Flag if
sla_time_to_resolution_elapsed_time
seems unusually long or short relative to other similar requests (this requires some judgment based on the data distribution – use reasonable assumptions).
Example Output demonstrating clarification of request type and inferred origin:
{
"description": "OS patching activity notification for Prod servers.",
"request_type": "Server Maintenance Notification", // Clarified from "Application Question or Request"
"request_origin": "Internal", // Inferred from the content and the reporter being from the BMG AppSupport Team
"logged_via": "Mail",
"assignment_group": "Application Support L1",
"client_information": "Unknown", // Although customer_location is "Other", no specifics are given.
"key_themes": ["OS patching", "GCP", "Production servers", "maintenance"],
"resolution": "Done",
"status": "Resolved",
"sla_information": {
"breached": false,
"paused": false,
"elapsed_time": 579457,
"remaining_time": 114620543,
"outlier": false // Reasonable given the context.
}
With this promt we could then ask Gemini to do the work. Running this on 1000 tickets took 40min at the time of testing.
WITH pompt_cte AS (
SELECT
"<INSERT PROMPT>" AS prompt),
prompt_with_data_cte AS (
SELECT
prompt || '\n\n' || text as prompt
FROM
support_ticket_table
CROSS JOIN
pompt_cte
)
SELECT
*
FROM
ML.GENERATE_TEXT( MODEL `utils.gemini_1_5_pro_002`,
(
SELECT
prompt
FROM
prompt_with_data_cte
), STRUCT(500 AS max_output_tokens))
Concatenate Summaries #
With the table generated from the above SQL I can checked if it can correctly extract all the data into a single string (and hope it does not become too big):
SELECT
STRING_AGG(JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text'), "\n") AS content,
LENGTH(STRING_AGG(JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text'), "\n")) as length
FROM analysis_prep;
We can also easily check the token usage now:
SELECT
SUM(CAST(JSON_VALUE(ml_generate_text_result, '$.usage_metadata.total_token_count') as INT64)) as total_token_count
FROM analysis_prep;
Generate Report #
Final Report Prompt
You are a customer support expert tasked with analyzing preprocessed customer support request data to identify patterns and areas for improvement. The data has been preprocessed with each request containing the following fields: description
, request_type
(clarified), request_origin
(inferred), logged_via
, assignment_group
, client_information
, key_themes
, resolution
, status
, and sla_information
(containing breached
, paused
, elapsed_time
, remaining_time
, and outlier
flags).
Use this preprocessed data to give insights into the following topics:
High-level patterns: Identify high-level patterns in the preprocessed data. Consider the clarified
request_type
, inferredrequest_origin
,client_information
, andkey_themes
. For instance, are certainrequest_type
values more frequent? Are specific client types (identified inclient_information
) submitting similar requests? Are there trends related torequest_origin
? Are certainkey_themes
frequently associated with specificrequest_type
s, client types, or origins?Frequency of request types: List the top most frequent
request_type
values. Analyze how these frequencies correlate with other fields likeclient_information
andrequest_origin
.SLA Performance: Analyze the
sla_information
fields.
- Determine which
request_type
values have the longestelapsed_time
, the highestbreached
rates, the most frequentpaused
instances, and the highest number ofoutlier
flags. - Investigate correlations between SLA breaches/pauses and other fields like
assignment_group
,request_origin
, andkey_themes
.
Assignment Group Efficiency: Investigate the relationship between
assignment_group
andsla_information
. Are certain groups more efficient (shorterelapsed_time
, fewerbreached
instances) than others? Are there imbalances in workload distribution among the groups (frequency of assignments)?Key Theme Analysis: Examine the
key_themes
field for recurring themes or keywords. This field has already been preprocessed to capture the true nature of the request, especially for genericrequest_type
values. Analyze the relationships betweenkey_themes
,request_type
, and other fields. Identify any emerging trends or underlying issues not captured by the originalcustomer_request_type
.Channel Analysis: Consider the impact of
request_origin
andlogged_via
. Are certain channels more prone to specifickey_themes
,request_type
s, or SLA breaches/delays?Recommendations and further investigation: Based on your analysis, provide actionable recommendations for improving customer support. This could include:
- Areas where automation might be beneficial based on recurring
request_type
s orkey_themes
. - Potential training needs for specific
assignment_group
s based on performance metrics. - Process improvements to address common issues identified in
key_themes
. - Suggestions for improving data quality or adding more specific fields for better analysis.
- Any other factors impacting customer support performance observed in the data.
Please do not provide code as output but only the patterns direct answers that you see. This does not need to be a super accurate report, we just want to understand where to focus.
This was the sql used for the final prompt:
WITH
pompt_cte AS (
SELECT
"<INSERT PROMPT>" AS prompt),
prompt_with_data_cte AS (
SELECT
prompt || '\n\n' || content as prompt
FROM (
SELECT
STRING_AGG(JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text'), ",") AS content
FROM analysis_prep)
CROSS JOIN
pompt_cte
)
SELECT
*,
JSON_VALUE(ml_generate_text_result, '$.candidates[0].content.parts[0].text') AS content,
CAST(JSON_VALUE(ml_generate_text_result, '$.usage_metadata.total_token_count') as INT64) as tokens
FROM
ML.GENERATE_TEXT( MODEL `utils.gemini_1_5_pro_002`,
(
SELECT
prompt
FROM
prompt_with_data_cte ), STRUCT(8000 AS max_output_tokens));
I unfortunately cannot share the results but can only encourage you to try.