News summary app with python


SOURCE: TOWARDSDATASCIENCE.COM
SEP 12, 2021

Photo by Roman Kraft on Unsplash

Build a pure Python app using Streamlit, sumy, and News API

There’s a lot of news out there, and it’s hard to keep up. Since I’m not a speed reader, I was curious if I could find a pure Python solution to help me stay informed. I’ve used News API in the past, and I use Streamlit a lot for data science projects. After a quick search, I found sumy (Github), and it turned out to be a good solution for the text summarization element of the problem.

Assumptions

This article assumes some working knowledge of Python, Terminal, and APIs. Even if you’re unfamiliar with any of these topics, I think you’ll find most of it pretty manageable and fun.

Install Libraries

pip install sumy

pip install streamlit

Sumy uses NLTK, which requires additional data to be downloaded, and you can find detailed instructions here.

This is how I did it on macOS:

Run the following code in a Python session:

This will launch the NLTK Downloader. For central installation, change the Download Directory text field to: /usr/local/share/nltk_data. If central installation causes permission errors, it’s fine to install the data elsewhere, but you will need to set the environment variable.

To add the environment variable using zsh, you can run the following command in Terminal:

echo "export NLTK_DATA= >> ~.zshenv"

Get Started With sumy

Here are the import statements you will need to run the code:

from sumy.parsers.html import HtmlParser

from sumy.nlp.tokenizers import Tokenizer

from sumy.summarizers.lsa import LsaSummarizer as Summarizer

from sumy.nlp.stemmers import Stemmer

from sumy.utils import get_stop_words

import requests

news_imports.py hosted with ? by GitHub

The below function is a modified version of the example on the sumy Github page. The function simply ranks the sentences in the text, and returns a str with the most important sentences. The sentences_count argument specifies the max number of sentences in summary. While sumy can be used with plain text files, this tutorial will only cover HTML inputs.

def summarize_html(url: str, sentences_count: int, language: str = 'english') -> str:
"""
Summarizes text from URL

Inputs
----------
url: URL for full text
sentences_count: specifies max number of sentences for return value
language: specifies language of text

Return
----------
summary of text from URL
"""
parser = HtmlParser.from_url(url, Tokenizer(language))
stemmer = Stemmer(language)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(language)

summary = ''
for sentence in summarizer(parser.document, sentences_count):
if not summary:
summary += str(sentence)
else:
summary += ' ' + str(sentence)

return summary

summarize_html_example.py hosted with ? by GitHub

The output shows the top 10 most important sentences in the article, which acts as a summary.

Here’s an example using the Wikipedia entry for Automatic summarization per the sumy Github page:

url = 'https://en.wikipedia.org/wiki/Automatic_summarization'
summarize_html(url, 10)

Get Started With News API

First, you need sign up and get an API key, which is easily done here . After signing up, an API key will automatically get generated and shown on the next screen. While there is a Python client library, I opted to create my own functions since it’s a straight forward API.

This article will cover two endpoints from the API:

  1. /v2/everything: Retrieves articles from archive of stories(documentation)
  2. /v2/top-headlines: Retrieves top headlines and breaking stories(documentation)

def news_api_request(url: str, **kwargs) -> list:
"""
Sends GET request to News API endpoint

Inputs
----------
url: full URL for endpoint
kwargs: please refer to
News API documentations:
https://newsapi.org/docs/endpoints/
(apiKey argument is required)

Return
----------
list containing data for each article in response
"""
params = kwargs
res = requests.get(url, params=params)
articles = res.json().get('articles')
return articles

news_api_request_function.py hosted with ? by GitHub

def summarize_news_api(articles: list, sentences_count: int) -> list:
"""
summarizes text at URL for each element of articles dict
(return value from news_api_request) and adds a new element
articles dict where the key is 'summary' and the value is
the summarized text

Inputs
----------
articles: list of dict returned from news_api_request()
sentences_count: specifies max number of sentences for
return value

Return
----------
articles list with summary element added to each dict
"""
for article in articles:
summary = summarize_html(article.get('url'), sentences_count)
article.update({'summary': summary})

return articles

summarize_news_api_function.py hosted with ? by GitHub

The summarize_news_api function iterates through each article, and passes the article’s URL to the summarize_html function. Each dict in articles gets updated with a new element for the summary.

The following code snippet is an example of using news_api_request and summarize_news together, to summarize the articles in the response from /v2/top-headlines endpoint.

url = 'https://newsapi.org/v2/top-headlines/'
articles = news_api_request(url, apiKey=api_key, sortBy='publishedAt', country='us')
summaries = summarize_news_api(articles)
print(summaries[0])

news_api_request_example.py hosted with ? by GitHub

If you run the above code, you will print a dict, representing the first article, with key/value pairs in the format shown below:

{'source': {'id': None, 'name': None},
'author': '',
'title': '',
'description': '',
'url': '',
'urlToImage': '',
'publishedAt': '',
'content': '',
'summary': ''}

news_api_request_response.py hosted with ? by GitHub

App Specs

Now that you have learned how to pull data from News API, and summarize the text, it’s time to build an app that meets the following criteria:

  1. Give users the ability to search for top stories by category. This will use the /v2/top-headlines endpoint from News API. The user will be able to choose from a list of categories as specified in the News API documentation.
  2. Give users the ability to search over all articles by search term, which the user will provide by the way of text input. This will use the /v2/everything endpoint from News API
  3. Give users the ability to specify the max length for the summaries

Get Started With Streamlit

Streamlit is a super intuitive front end solution for Python users. It lends itself well to creating prototypes and simple tools, which is why I thought it would be great for this project.

Streamlit will be introduced by building the front end of the app specified in the previous section.

Create a title

import streamlit as st

st.title('News Summarizer')

Run the app:

  1. Save the file as news_app.py
  2. Navigate to the folder in Terminal
  3. Run in Terminal: streamlit run news_app.py

A browser window should pop up, and look like this:

https://miro.medium.com/max/700/1*8dLOhshzpr36k7Q7SoJ_UA.png

Add some user input features

import streamlit as st

st.title('News Summarizer')

# Gives option between top stories and search term
search_choice = st.sidebar.radio('', options=['Top Headlines', 'Search Term'])

# Takes user input for max number of sentences per summary
sentences_count = st.sidebar.slider('Max sentences per summary:', min_value=1,
max_value=10,
value=3)

news_app_1.py hosted with ? by GitHub

Line 6 creates radio buttons to allow users to select which endpoint to search. The first argument is a label for the radio button cluster, which I left as a blank str since the ratio buttons are descriptive enough for this app. The second argument lets you pass the options for radio buttons as a list. The user input is then saved to search_choice as a str, which makes it simple to use throughout the app. In this example, the first element of the list is the default value.

Line 9 creates a slider input widget, which gives users the ability to specify the length of the summaries. Again, the first argument is a label for the widget. The remaining arguments define the range of the slider, and the default value. The user input is saved to sentences_count as an int.

When you run the app, you should see something like this:

https://miro.medium.com/max/700/1*JrHF44zSgSK7heM4jjNW-A.png

As you can see, the widgets appear on a collapsable sidebar. If you replace st.sidebar.radio with st.radio, the widget will appear under the title.

Add conditional features based on user input

The next features will depend on whether the user selects ‘Top Headlines’ or ‘Search Term’. If the ‘Top Headlines’ is selected, the user needs to have the option to select a category. If ‘Search Term’ is selected, the user will need to be able to enter a search term.

import streamlit as st

st.title('News Summarizer')

# Gives option between top stories and search term
search_choice = st.sidebar.radio('', options=['Top Headlines', 'Search Term'])

# Takes user input for max number of sentences per summary
sentences_count = st.sidebar.slider('Max sentences per summary:', min_value=1,
max_value=10,
value=3)

# If user selects 'Top Headlines', a category dropdown will appear
if search_choice == 'Top Headlines':
category = st.sidebar.selectbox('Category:', options=['business',
'entertainment',
'general',
'health',
'science',
'sports',
'technology'], index=2)

st.write(f'you will see articles about {category} here')
# This is where we will call get_top_headlines()

# If user selects 'Search Term', a text input box will appear
elif search_choice == 'Search Term':
search_term = st.sidebar.text_input('Enter Search Term:')

if not search_term:
st.write('Please enter a search term =)')
else:
st.write(f'you will see articles about {search_term} here')
# This is where you will call search_articles()

streamlit_news_widgets.py hosted with ? by GitHub

As you can see, the flow of the app is defined by the value of search_choice, which is used in the conditionals on lines 14 and 27. If the user selects ‘Top Headlines’, a dropdown will appear with options for categories. The options passed to st.sidebar.selectbox come from the NewsAPI documentation. The index argument specifies which element of the list to use as default, and I thought ‘general’ made sense.

If the user selects ‘Search Term’, a text box will appear to take user input. I added a conditional statement to prompt user for input if the text input box is empty. This helps to avoid errors when the API calls are added.

You should see something like this when you run the app:

https://miro.medium.com/max/700/1*gQbujTDwIF6eh5U5ExeZRQ.gif

Live preview of News Summarizer

This shows you the raw shell of the app, and it appears to be meeting the criteria set up above. Now you just need to incorporate the functions from earlier to hit News API and summarize the text.

Create separate file to store functions

I like to use a separate file for functions so that the Streamlit app stays clean. The below snippet shows five functions needed for the app. The first three functions, summarize_html ,news_api_request, and summarize_news_api are the same ones we used above. I added two new functions, search_articles and get_top_headlines, which both pass hardcoded endpoints to the news_api_request function to help simplify the code.

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer as Summarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import requests


def summarize_html(url: str, sentences_count: int, language: str = 'english') -> str:
"""
Summarizes text from URL

Inputs
----------
url: URL for full text
sentences_count: specifies max number of sentences for return value
language: specifies language of text

Return
----------
summary of text from URL
"""
parser = HtmlParser.from_url(url, Tokenizer(language))
stemmer = Stemmer(language)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(language)

summary = ''
for sentence in summarizer(parser.document, sentences_count):
if not summary:
summary += str(sentence)
else:
summary += ' ' + str(sentence)

return summary


def news_api_request(url: str, **kwargs) -> list:
"""
Sends GET request to News API endpoint

Inputs
----------
url: full URL for endpoint
kwargs: please refer to
News API documentations:
https://newsapi.org/docs/endpoints/
(apiKey argument is required)

Return
----------
list containing data for each article in response
"""
params = kwargs
res = requests.get(url, params=params)
articles = res.json().get('articles')
return articles


def summarize_news_api(articles: list, sentences_count: int) -> list:
"""
summarizes text at URL for each element of articles dict
(return value from news_api_request) and adds a new element
articles dict where the key is 'summary' and the value is
the summarized text

Inputs
----------
articles: list of dict returned from news_api_request()
sentences_count: specifies max number of sentences for
return value

Return
----------
articles list with summary element added to each dict
"""
for article in articles:
summary = summarize_html(article.get('url'), sentences_count)
article.update({'summary': summary})

return articles


def search_articles(sentences_count: int, **kwargs) -> list:
"""
Sends GET request to News API /v2/everything endpoint,
and summarizes data at each URL

Inputs
----------
sentences_count: specifies max number of sentences
for return value
kwargs: see News API
documentation:
https://newsapi.org/docs/endpoints/everything

Return
----------
list where each element is a dict containing info about a single article
"""
url = 'https://newsapi.org/v2/everything/'
articles = news_api_request(url, **kwargs)
return summarize_news_api(articles, sentences_count)


def get_top_headlines(sentences_count: int, **kwargs) -> list:
"""
Sends GET request to News API /v2/top-headlines endpoint,
and summarizes data at each URL

Inputs
----------
sentences_count: specifies max number of sentences for return value
kwargs: see News API
documentation:
https://newsapi.org/docs/endpoints/top-headlines

Return
----------
list where each element is a dict containing info
about a single article
"""
url = 'https://newsapi.org/v2/top-headlines/'
articles = news_api_request(url, **kwargs)
return summarize_news_api(articles, sentences_count)

app_functions.py hosted with ? by GitHub

Finish Streamlit app

import os
from app_functions import get_top_headlines, search_articles
import streamlit as st

API_KEY = os.environ['NEWS_API_KEY']

st.title('News Summarizer')

search_choice = st.sidebar.radio('', options=['Top Headlines', 'Search Term'])
sentences_count = st.sidebar.slider('Max sentences per summary:', min_value=1,
max_value=10,
value=3)

if search_choice == 'Top Headlines':
category = st.sidebar.selectbox('Search By Category:', options=['business',
'entertainment',
'general',
'health',
'science',
'sports',
'technology'], index=2)

summaries = get_top_headlines(sentences_count, apiKey=API_KEY,
sortBy='publishedAt',
country='us',
category=category)


elif search_choice == 'Search Term':
search_term = st.sidebar.text_input('Enter Search Term:')

if not search_term:
summaries = []
st.write('Please enter a search term =)')
else:
summaries = search_articles(sentences_count, apiKey=API_KEY,
sortBy='publishedAt',
q=search_term)

for i in range(len(summaries)):
st.title(summaries[i]['title'])
st.write(f"published at: {summaries[i]['publishedAt']}")
st.write(f"source: {summaries[i]['source']['name']}")
st.write(summaries[i]['summary']))

news_summarizer_app.py hosted with ? by GitHub

In the above code, line 2 imports the functions from app_function.py. On line 23, get_top_headlines is called, which calls the /v2/everything endpoint. This corresponds to the user selecting ‘Top Headlines’ for search_choice.

On line 36, search_articles is called, which calls the /v2/everything endpoint. This corresponds to the user selecting ‘Search Term’ for search_choice.

On line 40, the app will loop through summaries, and write out selected information that the app needs to display. In this example, the app will display title, date/time of publication, source name, and the summary that we generated in the preceding steps.

Finally, you can run the finished app:

streamlit run news_summarizer_app.py

You can see the full project here.

Future Enhancements

I think this app is a pretty good start, but there are a number of ways to make it better. Here are a few improvements I thought about:

  1. Look for alternative methods for text summarization
  2. Incorporate sentiment analysis
  3. Create more user input fields in the app so users can better customize queries
  4. Incorporate several APIs to get more articles

Conclusion

You just learned how to build a basic news summarization app in Python, using Streamlit, News API, and sumy. I hope this gives you some ideas for new projects. Thanks for reading.

Similar articles you can read