iOS - Speech to text, dictation
Categories:
Implementing Speech-to-Text (Dictation) in iOS Applications
Learn how to integrate powerful speech recognition capabilities into your iOS apps using Apple's Speech framework for dictation and voice commands.
Speech-to-text, often referred to as dictation, has become a fundamental feature in modern mobile applications, enhancing accessibility and user experience. iOS provides a robust Speech
framework that allows developers to easily incorporate speech recognition into their apps. This article will guide you through the process of setting up and using the Speech
framework to enable dictation, handle permissions, and process spoken input in real-time.
Understanding the iOS Speech Framework
The Speech
framework, introduced in iOS 10, provides APIs for converting spoken audio into text. It supports both live audio input from the device's microphone and pre-recorded audio files. The framework handles various languages and dialects, offering high accuracy and performance. Key components include SFSpeechRecognizer
for initiating recognition, SFSpeechAudioBufferRecognitionRequest
for live audio, and SFSpeechRecognitionTask
for managing the recognition process.
flowchart TD A[User Speaks] --> B{Microphone Input} B --> C[SFSpeechAudioBufferRecognitionRequest] C --> D[SFSpeechRecognizer] D --> E{Speech Recognition Service} E --> F[SFSpeechRecognitionTask] F --> G{Recognition Result (SFSpeechRecognitionResult)} G --> H[App Processes Text] H --> I[Display or Action]
Flow of speech recognition in an iOS application.
Setting Up Your Project and Requesting Permissions
Before you can use speech recognition, your application must request permission from the user to access the microphone and speech recognition services. This involves adding specific keys to your Info.plist
file and handling the permission request programmatically. Failure to do so will result in your app crashing when attempting to use speech recognition.
<key>NSSpeechRecognitionUsageDescription</key>
<string>Your app needs speech recognition to convert your voice to text.</string>
<key>NSMicrophoneUsageDescription</key>
<string>Your app needs microphone access to record your speech.</string>
import Speech
func requestSpeechAuthorization() {
SFSpeechRecognizer.requestAuthorization { authStatus in
OperationQueue.main.addOperation {
switch authStatus {
case .authorized:
print("Speech recognition authorized")
case .denied:
print("Speech recognition denied")
case .restricted:
print("Speech recognition restricted on this device")
case .notDetermined:
print("Speech recognition not yet determined")
@unknown default:
fatalError("Unknown authorization status")
}
}
}
}
Implementing Live Speech Recognition
To perform live speech recognition, you'll typically use an AVAudioEngine
to capture audio from the microphone and feed it into an SFSpeechAudioBufferRecognitionRequest
. The SFSpeechRecognizer
then processes this audio and provides results through a delegate or a completion handler. Remember to stop the audio engine and cancel the recognition task when you're done or when the user stops speaking.
import Speech
import AVFoundation
class SpeechRecognizerManager: NSObject, SFSpeechRecognizerDelegate {
private let speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en-US"))
private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
private var recognitionTask: SFSpeechRecognitionTask?
private let audioEngine = AVAudioEngine()
var recognizedTextHandler: ((String) -> Void)?
var isRecording = false
override init() {
super.init()
speechRecognizer?.delegate = self
}
func startRecording() throws {
guard let speechRecognizer = speechRecognizer, speechRecognizer.isAvailable else {
throw SpeechRecognitionError.recognizerUnavailable
}
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else { throw SpeechRecognitionError.requestFailed }
recognitionRequest.shouldReportPartialResults = true
let inputNode = audioEngine.inputNode
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { buffer, _ in
self.recognitionRequest?.append(buffer)
}
audioEngine.prepare()
try audioEngine.start()
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
if let result = result {
self.recognizedTextHandler?(result.bestTranscription.formattedString)
} else if let error = error {
print("Recognition error: \(error.localizedDescription)")
self.stopRecording()
}
}
isRecording = true
}
func stopRecording() {
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
recognitionRequest?.endAudio()
recognitionTask?.cancel()
recognitionTask = nil
recognitionRequest = nil
isRecording = false
}
// MARK: - SFSpeechRecognizerDelegate
func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) {
if available {
print("Speech recognizer is available.")
} else {
print("Speech recognizer is unavailable.")
}
}
enum SpeechRecognitionError: Error {
case recognizerUnavailable
case requestFailed
}
}
Integrating into a UIViewController
To make this functional in your app, you'll typically have a UIViewController
that manages the UI and interacts with the SpeechRecognizerManager
. This involves setting up a button to start/stop recording and a label or text view to display the recognized text.
import UIKit
class ViewController: UIViewController {
let speechManager = SpeechRecognizerManager()
let textView = UITextView()
let recordButton = UIButton(type: .system)
override func viewDidLoad() {
super.viewDidLoad()
setupUI()
speechManager.recognizedTextHandler = { [weak self] text in
self?.textView.text = text
}
requestSpeechAuthorization()
}
func setupUI() {
textView.translatesAutoresizingMaskIntoConstraints = false
textView.font = UIFont.systemFont(ofSize: 18)
textView.layer.borderColor = UIColor.lightGray.cgColor
textView.layer.borderWidth = 1.0
textView.isEditable = false
view.addSubview(textView)
recordButton.translatesAutoresizingMaskIntoConstraints = false
recordButton.setTitle("Start Recording", for: .normal)
recordButton.addTarget(self, action: #selector(recordButtonTapped), for: .touchUpInside)
view.addSubview(recordButton)
NSLayoutConstraint.activate([
textView.topAnchor.constraint(equalTo: view.safeAreaLayoutGuide.topAnchor, constant: 20),
textView.leadingAnchor.constraint(equalTo: view.leadingAnchor, constant: 20),
textView.trailingAnchor.constraint(equalTo: view.trailingAnchor, constant: -20),
textView.heightAnchor.constraint(equalToConstant: 200),
recordButton.topAnchor.constraint(equalTo: textView.bottomAnchor, constant: 20),
recordButton.centerXAnchor.constraint(equalTo: view.centerXAnchor)
])
}
@objc func recordButtonTapped() {
if speechManager.isRecording {
speechManager.stopRecording()
recordButton.setTitle("Start Recording", for: .normal)
} else {
do {
try speechManager.startRecording()
recordButton.setTitle("Stop Recording", for: .normal)
} catch {
print("Error starting recording: \(error.localizedDescription)")
// Handle error, e.g., show an alert
}
}
}
func requestSpeechAuthorization() {
SFSpeechRecognizer.requestAuthorization { authStatus in
OperationQueue.main.addOperation {
switch authStatus {
case .authorized:
self.recordButton.isEnabled = true
default:
self.recordButton.isEnabled = false
print("Speech recognition not authorized.")
// Inform user about permission issue
}
}
}
}
}
Speech
framework sends results incrementally. result.bestTranscription.formattedString
will give you the most accurate current transcription. You can use result.isFinal
to determine when the recognition task has completed processing a segment of speech.