Sunday, May 12, 2024
HomeBig DataVideo Summarization Utilizing OpenAI Whisper & Hugging Chat API

Video Summarization Utilizing OpenAI Whisper & Hugging Chat API


Introduction

“Much less is extra,” as architect Ludwig Mies van der Rohe famously mentioned, and that is what summarization means. Summarization is a vital device in decreasing voluminous textual content material into succinct, related morsels, interesting to at this time’s fast-paced info consumption. In textual content functions, summarization aids info retrieval, and helps decision-making. The mixing of Generative AI, like OpenAI GPT-3-based fashions, has revolutionized this course of by not solely extracting key components from textual content and producing coherent summaries that retain the supply’s essence. Apparently, Generative AI’s capabilities lengthen past textual content to video summarization. This includes extracting pivotal scenes, dialogues, and ideas from movies, creating abridged representations of the content material. You possibly can obtain video summarization in many alternative methods, together with producing a brief abstract video, performing video content material evaluation, and highlighting key sections of the video or making a textual abstract of the video utilizing video transcription

The Open AI Whisper API leverages automated speech recognition expertise to transform spoken language into written textual content, therefore rising accuracy and effectivity of textual content summarization. Alternatively, the Hugging Face Chat API gives state-of-the-art language fashions like GPT-3.

Studying Aims

On this article we’ll study:

  • We study video summarization strategies
  • Perceive the functions of Video Summarization
  • Discover the Open AI Whisper mannequin structure
  • Be taught to implement the video textual summarization utilizing the Open AI Whisper and Hugging Chat API

This text was printed as part of the Knowledge Science Blogathon.

Video Summarization Strategies

Video Analytics

It includes the method of extracting significant info from a video. Use deep studying to trace and establish objects and motion in a video and establish the scenes. A few of the common strategies for video summarization are:

Keyframe Extraction and Shot Boundary Detection

This course of contains changing the video to a restricted variety of nonetheless photos. Video skim is one other time period for this shorter video of keyshots.

Video photographs are non-interrupted steady sequence of frames. Shot boundary recognition detects transitions between photographs, like cuts, fades, or dissolves, and chooses frames from every shot to construct a abstract. The under are the foremost steps to extract a steady brief video abstract from an extended video:

  • Body Extraction – Snapshot of video is extracted from video, we will take 1fps for 30 fps video.
  • Face and Emotion Detection – We will then extract faces from video & rating the feelings of faces to detect emotion scores. Face detection utilizing SSD (Single Shot Multibox Detector).
  • Body Rating & Choice – Choose frames which have excessive emotion rating after which rank.
  • Ultimate Extraction – We extract subtitles from the video together with timestamps. We then extract the sentences comparable to the extracted frames chosen above, together with their beginning and ending instances within the video. Lastly, we merge the video elements corresponding to those intervals to generate the ultimate abstract video.

Motion Recognition and Temporal Subsampling

On this we attempt to establish human motion carried out within the video that is extensively used utility of Video analytics. We breakdown the video in small subsequences as a substitute of frames and attempt to estimate the motion carried out within the section  by classification and sample recognition strategies like HMC (Hidden Markov Chain Evaluation).

Single and Multi-modal Approaches

On this article we’ve got used single modal method the place in we use the audio of video to create a abstract of video utilizing textual abstract. Right here we use a
single facet of video which is audio convert it to textual content after which get abstract utilizing that textual content.

In multi-modal method we mix info from many modalities like audio, visible, and textual content, give a holistic information of the video content material for extra correct summarization.

Purposes of Video Summarization

Earlier than diving into the implementation of our video summarization we should always first know the functions of video summarization. Beneath are among the listed examples of video summarization in quite a lot of fields and domains:

  • Safety and Surveillance: Video summarization can permit us to research great amount of surveillance video to get necessary occasions spotlight with out manually reviewing the video
  • Schooling and Coaching: One can ship key notes and coaching video thus college students can revise the video contents with out going by way of the entire video.
  • Content material Searching: Youtube makes use of this to spotlight necessary a part of video related to consumer search with a view to permit customers to resolve they need to watch that specific video or not primarily based on their search necessities.
  • Catastrophe Administration: For emergencies and disaster video summarization can permit to take actions primarily based on conditions highlighted within the video abstract.

Open AI Whisper Mannequin Overview

The Whisper mannequin of Open AI is a automated speech recognition(ASR). It’s used for transcribing speech audio into textual content.

 Architecture of Open AI Whisper Model
Structure of Open AI Whisper Mannequin

It’s primarily based on the transformer structure, which stacks encoder and decoder blocks with an consideration mechanism that propagates info between them. It should take the audio recording, divide it into 30-second items, and course of each individually. For every 30-second recording, the encoder encodes the audio and preserves the situation of every phrase acknowledged, and the decoder makes use of this encoded info to find out what was mentioned.

The decoder will count on tokens from all of this info, that are principally every phrase pronounced. It should then repeat this course of for the next phrase , utilising all the similar info to help it establish the following one which makes extra sense.

 Whisper model task flowchart
Whisper mannequin activity flowchart

Coding Instance for Video Textual Summarization

 Flowchart of Textual Video Summarization
Flowchart of Textual Video Summarization

1 – Set up and Load Libraries

!pip set up yt-dlp openai-whisper hugchat
import yt_dlp
import whisper
from hugchat import hugchat

#Operate for saving audio from enter video id of youtube
def obtain(video_id: str) -> str:
    video_url = f'https://www.youtube.com/watch?v={video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/finest',
        'paths': {'residence': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
        }]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.obtain([video_url])
        if error_code != 0:
            increase Exception('Didn't obtain video')

    return f'audio/{video_id}.m4a'


#Name perform with video id
file_path = obtain('A_JQK_k4Kyc&t=99s')

3 – Transcribe audio to textual content utilizing Whisper

# Load whisper mannequin
whisper_model = whisper.load_model("tiny")

# Transcribe audio perform
def transcribe(file_path: str) -> str:
  # `fp16` defaults to `True`, which tells the mannequin to try to run on GPU.
  transcription = whisper_model.transcribe(file_path, fp16=False)
  return transcription['text']
  

#Name the transcriber perform with file path of audio  
transcript = transcribe('/content material/audio/A_JQK_k4Kyc.m4a')
print(transcript)

 4 – Summarize transcribed textual content utilizing Hugging Chat

Be aware to make use of hugging chat api we have to login or enroll on hugging face platform. After that instead of “username” and “password” we have to cross in our hugging face credentials.

from hugchat.login import Login

# login
signal = Login("username", "password")
cookies = signal.login()
signal.saveCookiesToDir("/content material")

# load cookies from usercookies
cookies = signal.loadCookiesFromDir("/content material") # This can detect if the JSON file exists, return cookies if it does and lift an Exception if it is not.

# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())  # or cookie_path="usercookies/<e-mail>.json"
print(chatbot.chat("Hello!"))

#Summarise Transcript
print(chatbot.chat('''Summarize the next :-'''+transcript))

Conclusion

In conclusion, the idea of summarization is a transformative pressure in info administration. It’s a robust device that distills voluminous content material into concise, significant types, tailor-made to the fast-paced consumption of at this time’s world.

Via the combination of Generative AI fashions like OpenAI’s GPT-3, summarization has transcended its conventional boundaries, evolving right into a course of that not solely extracts however generates coherent and contextually correct summaries.

The journey into video summarization unveils its relevance throughout numerous sectors. The implementation of how audio extraction, transcription utilizing Whisper, and summarization by way of Hugging Face Chat will be seamlessly built-in to create video textual summaries.

Key Takeaways

1. Generative AI: Video summarization will be achieved utilizing generative AI applied sciences akin to LLMs and ASR.

2. Purposes in Fields:  Video summarization is definitely useful in lots of necessary fields the place one has to research great amount of movies to mine essential info.

3. Fundamental Implementation:  On this article we explored primary code implementation of video summarization primarily based on audio dimension.

4. Mannequin Structure: We additionally learnt about primary structure of Open AI Whisper mannequin and its course of movement.

Steadily Requested Questions

Q1.  What are limits of Whisper API?

A. Whisper API name restrict is 50 in a min. There isn’t a audio size restrict however recordsdata upto 25 MB can solely be shared. One can cut back file dimension of audio by lowering bitrate of audio.

Q2. The Whisper API helps which file codecs?

A. The next file codecs: m4a, mp3, webm, mp4, mpga, wav, mpeg

Q3. What are the options of Whisper API?

A. A few of the main options for Automated Speech Recognition are – Twilio Voice, Deepgram, Azure speech-to-text, Google Cloud Speech-to-text.

This autumn. What are the restrictions of Automated Speech Recognition (ASR) system?

A. One of many the problem in comprehending numerous accents of the identical language, necessity for specialised coaching functions in specialised fields.

Q5. What are the options to Automated Speech Recognition (ASR)?

A. Superior analysis is happening within the area of speech recognition like decoding imagined speech from EEG indicators utilizing neural structure. This enables individuals
with speech disabilities to speak their ideas of speech to exterior world with assist of gadgets. One such fascinating paper right here.

The media proven on this article just isn’t owned by Analytics Vidhya and is used on the Writer’s discretion.

RELATED ARTICLES

Most Popular

Recent Comments