The primary objective of my CS291K final project was to develop an application that demonstrates a creative new idea of using neural networks. My personal experience as a graduate student at UCSB (e.g., lack of time to read four to five research papers for each course every week) made me realize the need and importance of an accurate long text summarization app. With numerous potential use cases in mind, I decided to design a state-of-the-art model powered long text summarization app named Precis, which will seamlessly summarize log texts based on the user's choice of word count and domain of the text.
The main inspiration was from my personal experience over the last couple of years as a computer science undergraduate and graduate researcher i.e. reading a lot of research papers on a regular basis. The popular online text summarization apps, such as Resoomer and QuillBot help users summarize short texts (~150-200 words) beyond which is a paid service. They lack the domain-specific summarization because of which such tools fail to summarize scientific papers and news at equal accuracy. They fail to identify subsections which results in information loss in case of summarizing medical documents and makes it unreliable.
The potential uses cases for Precis are as follows:
- Summarize long medical documents for doctors which will reduce time required for understanding a patient’s medical history
- Enhance the buyer's experience by providing them with an instant summary of the various reviews for a product
- Generate background summary of job applicants/insurance applicants instantly without losing the valuable information for a faster decision process
- Summarize multiple emails for busy professionals to identify which ones are worth replying or reviewing in detail.
As my use case was focussed on domain-specific long text summarization, my preliminary task was to decide on the specific functionalities that I wanted to include in the app. Based on the functionalities which I narrowed down for the minimum viable product, I decided on the following features for the app:
- Users can provide text to Precis in form of paragraph, document upload or enter link of website
- Precis to allow both extractive (extracts important sentences from the source documents which best depict the meaning of the whole document in a more concise manner) and abstractive summarization (generates sentences that are not in the document but are coherent and demonstrate the context of the document).
- Users can provide text from any domain for summarization and should be able to select from some of the important domains such as news, medical and scientific text to retrieve better accuracy in the summarized text.
The next step was to explore the various domain-specific datasets and document summarization technologies and frameworks that could help enable the above functionalities in the Precis within the limited time frame of the project and the course.
I finalized on the following datasets:
News - cnn_dailymail
- English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail.
- Supports both extractive and abstractive summarization
Scientific - arxiv_dataset
- A dataset of 1.7 million arXiv articles
- Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format.
Medical - pubmed
- NLM produces a baseline set of MEDLINE/PubMed citation records in XML format for download on an annual basis.
- The annual baseline is released in December of each year.
I explored the following frameworks and deep learning models:
Although the Vaswani model has gained a lot of recognition in text summarization, it fails to perform at its best in case of long-text summarization tasks. The model scales poorly with the length of the input sequence because of the self-attention layer. It becomes computationally intensive (i.e. quadratic space and time complexity) to produce all similarity scores in each self-attention layer. Every token in the input sequence computes the score with every other token by taking the dot product of its query and key vectors which results in usability and scalability issues for long input sequences.
One of the prevalent solutions to dealing with long texts is methodology that gains attention from all tokens by limiting the quadratic time complexity. This is achievable by computing limited pairs of scores from input sequences called sparse attention patterns.
Variants of transformers
In this exploration stage, I performed an in-depth analysis on the framework of state-of-the-art sparse attention models with linear and logarithmic complexity (list as shown in figure below) such as Reformer, Performer, Informer, Terraformer, Longformer and BigBird. Based on the model architecture and reported results by the authors in the respective research papers, I decided to move forward with Longformer and BigBird because of the following advantages:
- Speed: Faster inference holds very high importance in case of online tools
- Scalability: Lighter models are easier to scale across multiple cloud services which results in faster deployment
The primary step in this project was to understand how sparse attention mechanisms work and decide on the parameters to finetune the state-of-the-art models on datasets of choice. After finalizing on the models, the task was to build the frontend, backend, connect them together and finally deploy to a cloud service for global usage.
My research for the course project was primarily focused on the following two models:
The authors of Longformer proposed a sparsified form of self-attention mechanism in which they specify the input locations of tokens attending to one another. This helps the model scale linearly with the input sequences and hence, makes it efficient for longer sequences. The key components of the attention pattern are:
Sliding Window Attention
- It proposes a fix-sized windowed attention surrounding each token
- The first layer only takes w/2 tokens on each side. However, when such layers are stacked above one another. The top most layer is able to attend to all the tokens in the input text.
- Through this methodology, although there is information loss across a wide range in a single layer, the attention from all tokens is gained through multiple stacked layers which results in a large receptive field.
- The result is the memory complexity per layer is O(n) which is an advantage in case of long inputs.
Dilated Sliding Window Attention
- To further increase the receptive field, a dilated sliding window is proposed.
- In this case, the windows have gaps of size dilation d and they are able to attend to tokens with a difference of d within a window of size w+d.
- Although the dilated sliding window might take a lot of layers to get attention of the entire sequence, this methodology becomes beneficial when the sliding window is stacked in lower layers to capture the local information and dilated sliding window in higher layers to capture the global information.
Global and Sliding Window Attention
- In this, specific tokens are chosen that attend to all tokens across the sequence and all tokens attend to it.
This model was proposed ~6 months after the longformer and performs slightly better as compared to longformer because of one additional component. Apart from the sliding window attention and global attention similar to that of longformer, a random attention pattern has been proposed.
- In this case, each node (i.e. tokens) randoms attends to a selected number of tokens in the layer. Each node attends to another node by a random walk which takes logarithmic time as per graph theory. The transfer of attention from one node to another node happens once per layer
All transformers model compute attention scores as follows:
Longformer/BigBird uses the following sets of projections:
- Qs, Ks, Vs to compute attention scores of sliding window
- Qg, Kg, Vg to compute attention scores of global window
This provides the flexibility to model for different types of attention. The most expensive operation is matrix multiplication QKT as Q and K have n projections (same as input sequence length). However, the sliding window and global attention calculate on a fixed number of diagonals which results in linear increase in memory usage compared to quadratic complexity for full self-attention.
The development process for Precis primarily comprised of the following steps:
The Precis app has 2 major components:
- Website - responsible for all the user interactions with the application such as choosing the input text type (in form of paragraph/document or website link), desired length of summary (i.e. the compression ratio) and the domain of the input text if required (news/medical/scientific)
- Cloud server - responsible for running the deep learning models and process input text as and when required.
For the frontend, I designed the UI (User Interface) screens using Flask. There are primarily three webpages as follows:
- Text summarization with input in form of paragraph: This page allows users to input a text paragraph, select a desired domain based on input paragraph (default is selected initially if user is unsure of domain) and desired length of summary if type of summary is extractive. The user can’t select the desired length of summary in case of Automatic as that selection will perform abstractive summarization based on the trained model. The “Manual” mode can summarize upto 4K tokens and the “Automatic” mode can summarize upto 8K tokens.
- Text summarization with input in form of document upload: This page allows users to upload a document upto 8K tokens, select the type of summary and length of summary in case of Manual type of summarization. The user can’t select the desired length of summary in case of Automatic as that selection will perform abstractive summarization based on the trained model. The model performs best in case of scientific research papers.
- Text summarization with input in form of website link: This page allows users to feed input text in the form of a website link. This can process upto 4K tokens and is not domain specific. The model performs extractive summarization on the text retrieved from the website.
For the backend, I designed the database framework and the APIs that will help interact with the server. Further, I saved the models that I finetuned to specific datasets as pickle files which are loaded for processing the user’s input data when required.
In the above diagram, “Upload” depicts the Storage in Cloud Datastore where all the files uploaded by the user for summarization are stored.
The following are key technical features and models that have been implemented in the app:
- Finetuned BERT on CNN/Dailymail dataset for 4K tokens
- Finetuned Longformer Encoder-Decoder (LED) on CNN/Dailymail dataset for 8K tokens
- Finetuned BigBird on CNN/Dailymail dataset for 8K tokens
- Finetuned BERT on Arxiv dataset for 4K tokens
- Finetuned Longformer Encoder-Decoder (LED) on Arxiv dataset for 16K tokens
- Finetuned BERT on Pubmed dataset for 4K tokens
- Finetuned Longformer Encoder-Decoder (LED) on Pubmed dataset for 8K tokens
- Finetuned BigBird Pegasus on Pubmed dataset or 4K tokens
What went right… and why
Based on the results obtained from finetuning process, the above models were finally used in the application.
The initial idea of the project was to design an online summarization application which could help us summarize long texts. According to me, the primary goal was accomplished with basic features within the timeline. The Precis app is now able to:
- Take input as paragraph/document upload/website link
- Summarize long texts of length 4K-8K tokens in matter of seconds
- Take a compression ratio input from the user and produce result accordingly
- Summarize different texts with greater accuracy based on domain selected
- Able to do both extractive and abstractive summarization based on user’s selection
To summarize, I think the major challenge of the project was exploring and making use of different models developed under different frameworks to work together as a whole. This was in turn successful due to a number of continuous small-scale tests performed throughout the development process.
What went wrong… and why
Certain decisions that were made from the app’s design perspective which went wrong are as follows:
- Although the finetuned models performed well in terms of the ROUGE scores, the inference time was too long in case of extractive summarization which is not very desirable when the user wants to do extractive summarization. The long delay was due to the model size. Hence, I had to replace it with a lighter model (i.e. TextRank model) in the final deployment.
- Due to the large size of models, the model is still computationally expensive on cloud servers. The deployment zip file of the project is sized at 1.2 GB which is nearly double the size that popular staging servers such as Heroku and Google Cloud allow in their free tier.
Precis can be further extended into new applications or use cases for the application of neural networks. Some of the new and novel use cases for further extension of this work could be any of the following:
- Quick summary of Terms and Conditions, GDPR, or Privacy Policies of apps and websites for consumers
- Summarization of large Twitter Threads via a Twitter Bot
- Scan and translate/summarize physical documents or signages (via OCR/live text APIs), e.g. museum description signage
- Summarization of financial earnings reports of public companies
- Recap for digital books and audiobooks (similar to recap for TV shows)
- Better email newsletter plugin for publishers (e.g. send summary of posts, v/s generic trimmed first paragraph)
- Medical and insurance forms and claims’ summary
- Diagnosis aid for physicians (e.g. summarize notes and suggest conditions or next steps for physicians based on the diagnosis)
- The LongT5 model can be used to process even longer texts.
- SciBERTSUM: Extractive Summarization for Scientific Documents
- Sparse is enough in scaling Transformers: Terraformer
- Fine-tune BERT for Extractive Summarization
- PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document summarization
- Finetune BERT for Extractive Summarization