CliffsNotes your voicemails with WebSockets, Twilio, and OpenAI

By Joel Hans
October 2, 2024
By Joel Hans
October 2, 2024
While ngrok started as the way to get a local service online in a single line, it’s evolved to be much more than that (say, a universal gateway used by 80% of the Cloud 100!).
Today, let’s hearken back to why ngrok was created: testing your local services against things outside localhost
.
Just before the first session of Office Hours went live, I got a fantastic question:
How can I set up secure WebSockets with ngrok to use the Twilio [Media Stream] API?
First off, let’s clarify one thing about WebSockets, since we get questions about it now and then: ngrok not only supports WebSockets (WS) via default HTTP tunnels, but also secures them by doing all the messy work around TLS certificates and termination for you.
You can securely expose your service to Twilio’s API (or any other) for development and testing, and when you’re ready to deliver to prod, you don’t need to change your configuration—add a custom domain and a few Traffic Policy rules if you need extra security or rate limiting.
Here’s a quick walkthrough of the demo I wanted to do live.
Actually, that’s a lie.
I was going to showcase one of Twilio’s Media Streams Demos, but when it comes to publishing a version of that here, it suddenly felt lacking. I needed to spice things up a bit.
Taking inspiration from those quickstarts, especially the ones that connect phone audio to a text summarization service, I decided to modernize them and go one step further: Create an API service that Twilio can POST, then stream audio over WebSockets, to then get summarized with the help of OpenAI.
Here’s what I came up with.
require('dotenv').config();
const express = require('express');
const http = require('http');
const WebSocket = require('ws');
const path = require('path');
const twilio = require('twilio');
const { OpenAI, toFile } = require('openai');
const TwilioMediaStreamSaveAudioFile = require('twilio-media-stream-save-audio-file');
const fs = require('fs').promises;
const app = express();
const server = http.createServer(app);
const wss = new WebSocket.
Server({ server });
const PORT = process.env.
PORT || 3000;
const openai = new OpenAI({ apiKey: process.env.
OPENAI_API_KEY });
const mediaStreamSaver = new TwilioMediaStreamSaveAudioFile({ saveLocation: `${__dirname}/temp` });
wss.on('connection', (ws) => {
ws.on('message', async (message) => {
const { event, media } = JSON.parse(message);
switch (event) {
case 'start':
console.log('Call connected.
Starting media stream...');
mediaStreamSaver.twilioStreamStart();
break;
case 'media':
mediaStreamSaver.twilioStreamMedia(media.payload);
break;
case 'stop':
console.log('Call ended.');
mediaStreamSaver.twilioStreamStop();
try {
const transcription = await transcribeAudio();
const summary = await summarizeText(transcription);
console.log('Transcription:', transcription);
console.log('Summary:', summary);
ws.send(JSON.stringify({ transcription, summary }));
} catch (error) {
console.error('Error processing audio:', error);
ws.send(JSON.stringify({ error: error.message }));
}
break;
}
});
});
async function transcribeAudio() {
const audioData = await fs.readFile(mediaStreamSaver.wstream.path);
const file = await toFile(audioData, path.basename(mediaStreamSaver.wstream.path), { type: 'audio/wav' });
return openai.audio.transcriptions.create({
file: file,
model: 'whisper-1',
response_format: 'text',
});
}
async function summarizeText(transcription) {
const response = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [
{ role: "system", content: "You are a helpful assistant that summarizes text as succinctly as possible.
" },
{ role: "user", content: `Please summarize the following text in a single sentence: ${transcription}` }
],
});
return response.choices[0].message.content;
}
app.use(express.urlencoded({ extended: false }));
app.post('/twiml', twilio.webhook({validate: false}), (req, res) => {
const twiml = new twilio.twiml.
VoiceResponse();
twiml.start().stream({ url: `wss://${req.headers.host}/message` });
twiml.say('Please start speaking.');
twiml.pause({ length: 30 });
res.type('text/xml').send(twiml.toString());
});
server.listen(PORT, () => {
console.log(`Server is running on port ${PORT}`);
});
I would’ve loved to comment this API so it’s completely self-explanatory, but in short, an Express server responds to POST
requests to the /twiml
route (TwiML being the Twilio Markup Language) by starting a WS connection and instructing Twilio on how to handle the phone call.
The very handy twilio-media-stream-save-audio-file project captures and decodes Twilio’s streaming audio (mysteriously encoded in… MULAW?), then saves it as a local .wav
file.
That file then goes to OpenAI’s Whisper model for transcription, which we then pipe to ChatGPT for summarization.
ngrok operates as an API-gateway-in-development, tunneling Twilio’s traffic to my localhost.
Whew.
Here’s what I used to set the project up:
twilio api:core:available-phone-numbers:local:list --country-code="US" --voice-enabled --properties="phoneNumber"
2.
twilio api:core:incoming-phone-numbers:create --phone-number="+123456789"
OPENAI_API_KEY
and TWILIO_AUTH_TOKEN
.npm install
to get dependencies./temp
directory for storing Twilio streams with mkdir temp
.And the demo itself:
node server.js
.ngrok http 3000 --url twilio.{NGROK_DOMAIN}
.twilio api:core:calls:create --from="+123456789" --to="{MY_PHONE_NUMBER_PLEASE_DONT_ASK}" --url="https://{NGROK_DOMAIN}/twiml
Here was the result:
Call connected.
Starting media stream...
Call ended.
Transcription: It's September 27th today, and here in Tucson, Arizona,
it's still over 100 degrees.
I think it's 105 today, and it's supposed
to be, you know, it's almost the end of September.
It's almost October.
It's fall.
The average high for this time of year is supposed to be in
the low 90s, I think.
So we're talking 10 degrees plus where it's
supposed to be.
It's just totally unfair.
That's all.
Summary: In Tucson, Arizona on September 27th, the temperature is
unseasonably hot at over 100 degrees, which is about 10 degrees higher
than the typical average for this time of year.
Crude? Yes. A great example of using ngrok to secure WS from a local webserver and make them accessible to a public API or service like Twilio? Absolutely.
And in ngrok’s Traffic Inspector, you can also analyze how ngrok gives the existing HTTP connection a Connection: Upgrade
, securing your WS implementation without any extra configuration.
This project was a perfect example of the use case that got ngrok started more than 10 years ago: exposing local services (WebSockets and beyond) to public webhooks and APIs. Imagine how painful this would have been if I had to push my WS server to a production system after every change?
ngrok undoubtedly sped up the pace of my development process, but it’s also expanded far beyond that founding use case of webhook testing and tunneling to localhost
—I could just as quickly and easily go-live in production without a single change to how I use ngrok.
Maybe I’d just add some Traffic Policy magic with request variables and CEL?
These are the kinds of use cases and paths to prod we’ll continue exploring in the next session of Office Hours. I’d love for you to join us! When you register for the next livestream, please ask your question in advance—these chats are yours to shape, and I can only craft demos like these if I know my goalposts ahead of time.
In the meantime, if you’ve been meaning to start developing a new app alongside Twilio or any other external API, give ngrok a try with a free account. This need to expose local services certainly hasn’t—and probably never will—go away, and even after all these years, ngrok remains your app’s simplest and most secure front door.