How to split large text into smaller chunks in Python

Text-to-speech (TTS) technology has come a long way in recent years, with many companies offering APIs that allow developers to easily convert written text into spoken words. One such company is Wellsaid Labs, which offers a TTS API that can be used to generate audio files from text.

However, there is a catch: the API can only process texts with a maximum length of 299 characters. This means that if you have a longer text, you will need to split it into smaller chunks before sending it to the API.

To solve this problem, I have written a Python function that takes in a text and splits it into sub-paragraphs of a maximum of 299 characters. The function works by first dividing the text into paragraphs, and then dividing each paragraph into sentences. It then iteratively adds sentences to a sub-paragraph until it reaches the maximum length, at which point it stores the sub-paragraph and starts a new one.

import re
import json

def split_text(text, max_chars):
    # Split the text into paragraphs
    paragraphs = re.split(r"(?i)paragraph \d+\.?", text)

    # Remove the first element (it will be an empty string)
    paragraphs = paragraphs[1:]

    # Initialize a dictionary to store the sub-paragraphs
    sub_paragraphs = {}

    # Initialize a counter for the paragraphs
    paragraph_counter = 1

    # Iterate over the paragraphs
    for paragraph in paragraphs:
        # Split the paragraph into sentences
        sentences = re.split(r"(?<=[.!?])\s+", paragraph)

        # Initialize a dictionary to store the current sub-paragraphs
        current_sub_paragraphs = {}

        # Initialize a variable to store the current sub-paragraph
        current_sub_paragraph = ""

        # Initialize a counter for the sub-paragraphs
        sub_paragraph_counter = 1

        # Iterate over the sentences
        for sentence in sentences:
            # If the sentence fits in the current sub-paragraph
            if len(current_sub_paragraph) + len(sentence) <= max_chars:
                # Add the sentence to the current sub-paragraph
                current_sub_paragraph += sentence
            else:
                # Add the current sub-paragraph to the dictionary of sub-paragraphs
                current_sub_paragraphs[f"Chunk_{sub_paragraph_counter}"] = current_sub_paragraph
                
                # Increment the counter
                sub_paragraph_counter += 1

                # Reset the current sub-paragraph and add the sentence
                current_sub_paragraph = sentence

        # Add the remaining sub-paragraph to the dictionary of sub-paragraphs
        current_sub_paragraphs[f"Chunk_{sub_paragraph_counter}"] = current_sub_paragraph

        # Add the dictionary of sub-paragraphs to the main dictionary
        sub_paragraphs[f"Par_{paragraph_counter}"] = current_sub_paragraphs

        # Increment the counter
        paragraph_counter += 1

    # Return the sub-paragraphs in a pretty JSON object
    return json.dumps(sub_paragraphs, indent=2)
    

# Set the text to split
text_raw = # Set the text to split
text_raw = """Paragraph 1.In this video, we are going to learn about the IT datacenter, various datacenter components, the concept of disaster recovery, and finally 
discuss different types of datacenters like on-premises datacenters, cloud datacenters, and edge datacenters. All right, so what is a data center? Well, simply, 
it's a room that supports IT equipment, which means the closet, a big facility, a big warehouse, or anything that can be described as a facility or room that 
supports data and telecommunications equipment and components that could be considered a datacenter.  
Paragraph 2.Datacenters are used to store, process, and manage large amounts of data, including data storage, analytics, and machine learning applications. 
Datacenters can be owned and operated by a business or organization, or they can be owned by a third party and rented out to multiple customers. Datacenters 
are often used to host websites and web applications, as well as other business-critical systems and applications. So what is in a datacenter? Some of the IT 
components that you will find in a datacenter include servers, storage devices, networking equipment, power systems, cooling systems, and physical securities. 
Let’s cover each topic briefly. 
Paragraph 3.Servers are the primary workhorses of a datacenter, and are used to run applications, store data, and perform various tasks.Storage devices are 
used to store data, and can include things like hard drives, storage arrays, and other types of storage systems. Networking equipment includes things like 
switches, routers, and other types of equipment that are used to connect devices and systems.
Paragraph 4.Datacenters require a reliable and stable power supply, so they are often equipped with generators, uninterruptible power supplies or UPS systems, 
and power distribution units. Since data centers make a lot of heat, they need cooling systems to keep the equipment at the right temperature. Physical security 
measures like security cameras, access control systems, and other measures protect the datacenter and its equipment from unauthorized access or tampering.
Paragraph 5.Now that you have an overview of a datacenter, let's talk about disaster recovery. In the event of a disaster or unexpected events, such as a natural 
disaster, power outage, or cyber-attack, a datacenter needs to have measures in place to 
ensure it can recover quickly and continue to function normally. This is where disaster recovery comes in. Disaster recovery typically includes things like 
backup systems, redundant hardware and systems, and offsite data storage facilities. If a disaster occurs and a company has a plan in place, it can continue 
operations at a disaster recovery site until it becomes safe to resume work at its usual location or a new permanent location.
Paragraph 6.Finally, let's discuss the different types of datacenters. There are three main types which are: on-premise datacenters, cloud datacenters, and 
edge datacenters. On-premise datacenters are owned and operated by a company and are located on the company's own premises. They are used to store and manage 
the company's own data and systems and are not shared with other organizations. Cloud datacenters are owned and operated by a third-party provider and are 
typically accessed through the internet. Companies can use cloud datacenters to store and manage their data and systems, rather than having to build and maintain 
their on-premise datacenters. There are many cloud datacenter providers in the market today, including Amazon Web Services or AWS, Microsoft Azure, Google Cloud, 
IBM Cloud etc. Edge datacenters are small datacenters that are located close to the edge of a network. They provide the same devices found in traditional data 
centers, but have a smaller footprint, closer to end users and devices. They are often used in industries like manufacturing, retail, and transportation to reduce 
latency and improve the speed and performance of applications and services.
Paragraph 7.Red Hat offers technologies that can support any use case, whether it is for on-premises datacenters, cloud datacenters, or edge data centers. We 
hope this overview has given you a better understanding of IT datacenters. We will cover more IT topics in the upcoming videos. Thank you for watching!
"""

# This will strip text of a new line escape characters
text = text_raw.replace("\n", "")

# Set the maximum number of characters per sub-paragraph
max_chars = 299

# Split the text into sub-paragraphs
sub_paragraphs = split_text(text, max_chars)

# Print the sub-paragraphs
print(sub_paragraphs)

# Let's check the length of each value within json object
json_data = json.loads(sub_paragraphs)

for key, value in json_data.items():
    if isinstance(value, dict):
        for inner_key, inner_value in value.items():
            print(f"{key}/{inner_key}: {len(inner_value)}")
    else:
        print(f"{key}: {len(value)}")

This function can be very useful when working with the Wellsaid Labs TTS API, as it allows you to easily split longer texts into smaller chunks that can be processed by the API. However, it can also be used in other situations where you need to split a text into smaller pieces, such as when sending messages through a messaging app or displaying text on a webpage.

To use the function, simply pass in the text you want to split and the maximum number of characters per sub-paragraph. The function will return a JSON object with the structure:

{
  "Par_1": {
    "Chunk_1": "In this video, we are going to learn about the IT datacenter, various datacenter components, the concept of disaster recovery, and finally discuss different types of datacenters like on-premises datacenters, cloud datacenters, and edge datacenters.All right, so what is a data center?",
    "Chunk_2": "Well, simply, it's a room that supports IT equipment, which means the closet, a big facility, a big warehouse, or anything that can be described as a facility or room that supports data and telecommunications equipment and components that could be considered a datacenter."
  },
  "Par_2": {
    "Chunk_1": "Datacenters are used to store, process, and manage large amounts of data, including data storage, analytics, and machine learning applications.Datacenters can be owned and operated by a business or organization, or they can be owned by a third party and rented out to multiple customers.",
    "Chunk_2": "Datacenters are often used to host websites and web applications, as well as other business-critical systems and applications.So what is in a datacenter?",
    "Chunk_3": "Some of the IT components that you will find in a datacenter include servers, storage devices, networking equipment, power systems, cooling systems, and physical securities.Let’s cover each topic briefly."
  },
  "Par_3": {
    "Chunk_1": "Servers are the primary workhorses of a datacenter, and are used to run applications, store data, and perform various tasks.Storage devices are used to store data, and can include things like hard drives, storage arrays, and other types of storage systems.",
    "Chunk_2": "Networking equipment includes things like switches, routers, and other types of equipment that are used to connect devices and systems."
  },
  "Par_4": {
    "Chunk_1": "Datacenters require a reliable and stable power supply, so they are often equipped with generators, uninterruptible power supplies or UPS systems, and power distribution units.Since data centers make a lot of heat, they need cooling systems to keep the equipment at the right temperature.",
    "Chunk_2": "Physical security measures like security cameras, access control systems, and other measures protect the datacenter and its equipment from unauthorized access or tampering."
  },
  "Par_5": {
    "Chunk_1": "Now that you have an overview of a datacenter, let's talk about disaster recovery.",
    "Chunk_2": "In the event of a disaster or unexpected events, such as a natural disaster, power outage, or cyber-attack, a datacenter needs to have measures in place to ensure it can recover quickly and continue to function normally.This is where disaster recovery comes in.",
    "Chunk_3": "Disaster recovery typically includes things like backup systems, redundant hardware and systems, and offsite data storage facilities.",
    "Chunk_4": "If a disaster occurs and a company has a plan in place, it can continue operations at a disaster recovery site until it becomes safe to resume work at its usual location or a new permanent location."
  },
  "Par_6": {
    "Chunk_1": "Finally, let's discuss the different types of datacenters.There are three main types which are: on-premise datacenters, cloud datacenters, and edge datacenters.On-premise datacenters are owned and operated by a company and are located on the company's own premises.",
    "Chunk_2": "They are used to store and manage the company's own data and systems and are not shared with other organizations.Cloud datacenters are owned and operated by a third-party provider and are typically accessed through the internet.",
    "Chunk_3": "Companies can use cloud datacenters to store and manage their data and systems, rather than having to build and maintain their on-premise datacenters.There are many cloud datacenter providers in the market today, including Amazon Web Services or AWS, Microsoft Azure, Google Cloud, IBM Cloud etc.",
    "Chunk_4": "Edge datacenters are small datacenters that are located close to the edge of a network.They provide the same devices found in traditional data centers, but have a smaller footprint, closer to end users and devices.",
    "Chunk_5": "They are often used in industries like manufacturing, retail, and transportation to reduce latency and improve the speed and performance of applications and services."
  },
  "Par_7": {
    "Chunk_1": "Red Hat offers technologies that can support any use case, whether it is for on-premises datacenters, cloud datacenters, or edge data centers.We hope this overview has given you a better understanding of IT datacenters.We will cover more IT topics in the upcoming videos.Thank you for watching!"
  }
}

To check the length of each value within the nested json object you can use the following code:

json_data = json.loads(sub_paragraphs)

for key, value in json_data.items():
    if isinstance(value, dict):
        for inner_key, inner_value in value.items():
            print(f"{key}/{inner_key}: {len(inner_value)}")
    else:
        print(f"{key}: {len(value)}")

If using the text from my examples it should look something like this:

Par_1/Chunk_1: 284
Par_1/Chunk_2: 272
Par_2/Chunk_1: 287
Par_2/Chunk_2: 153
Par_2/Chunk_3: 204
Par_3/Chunk_1: 256
Par_3/Chunk_2: 135
Par_4/Chunk_1: 288
Par_4/Chunk_2: 172
Par_5/Chunk_1: 82
Par_5/Chunk_2: 261
Par_5/Chunk_3: 133
Par_5/Chunk_4: 198
Par_6/Chunk_1: 265
Par_6/Chunk_2: 228
Par_6/Chunk_3: 296
Par_6/Chunk_4: 214
Par_6/Chunk_5: 166
Par_7/Chunk_1: 294

Subscribe to Vitalij Neverkevic Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe