Recherche avancée

Médias (0)

Mot : - Tags -/albums

Aucun média correspondant à vos critères n’est disponible sur le site.

Autres articles (1)

  • Submit bugs and patches

    13 avril 2011

    Unfortunately a software is never perfect.
    If you think you have found a bug, report it using our ticket system. Please to help us to fix it by providing the following information : the browser you are using, including the exact version as precise an explanation as possible of the problem if possible, the steps taken resulting in the problem a link to the site / page in question
    If you think you have solved the bug, fill in a ticket and attach to it a corrective patch.
    You may also (...)

Sur d’autres sites (235)

  • Is there a command line tool or ffmpeg /sox command to generate speech labels ? similar to audacity sound finder

    7 septembre 2017, par TROUZINE Abderrezaq

    Is there a command line tool or ffmpeg / sox command to generate speech labels ? similar to audacity sound finder.
    Only timeStart and timeEnd are needed in the output.
    Preferably, to generate from a given timeStart to a given timeEnd.

  • Recording streams audio with ffmpeg for Cloud Speech-to-Text

    25 novembre 2019, par Ernesto Pariona Diaz

    Goodnight

    I am trying to record audio with the following features :

    codec : flac
    sampling rate : 16000hz

    I am testing with the following line of code :

    ffmpeg -t 15 -i http://198.15.86.218:9436/stream -codec:a flac -b:a 16k example.flac

    But when reviewing the output file, I get the following :

    codec : flac
    sampling rate : 44000hz

    I could guide the correct use of ffmpeg options.

  • Google Speech API + Go - Transcribing Audio Stream of Unknown Length

    14 février 2018, par Josh

    I have an rtmp stream of a video call and I want to transcribe it. I have created 2 services in Go and I’m getting results but it’s not very accurate and a lot of data seems to get lost.

    Let me explain.

    I have a transcode service, I use ffmpeg to transcode the video to Linear16 audio and place the output bytes onto a PubSub queue for a transcribe service to handle. Obviously there is a limit to the size of the PubSub message, and I want to start transcribing before the end of the video call. So, I chunk the transcoded data into 3 second clips (not fixed length, just seems about right) and put them onto the queue.

    The data is transcoded quite simply :

    var stdout Buffer

    cmd := exec.Command("ffmpeg", "-i", url, "-f", "s16le", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", "-")
    cmd.Stdout = &stdout

    if err := cmd.Start(); err != nil {
       log.Fatal(err)
    }

    ticker := time.NewTicker(3 * time.Second)

    for {
       select {
       case <-ticker.C:
           bytesConverted := stdout.Len()
           log.Infof("Converted %d bytes", bytesConverted)

           // Send the data we converted, even if there are no bytes.
           topic.Publish(ctx, &pubsub.Message{
               Data: stdout.Bytes(),
           })

           stdout.Reset()
       }
    }

    The transcribe service pulls messages from the queue at a rate of 1 every 3 seconds, helping to process the audio data at about the same rate as it’s being created. There are limits on the Speech API stream, it can’t be longer than 60 seconds so I stop the old stream and start a new one every 30 seconds so we never hit the limit, no matter how long the video call lasts for.

    This is how I’m transcribing it :

    stream := prepareNewStream()
    clipLengthTicker := time.NewTicker(30 * time.Second)
    chunkLengthTicker := time.NewTicker(3 * time.Second)

    cctx, cancel := context.WithCancel(context.TODO())
    err := subscription.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {

       select {
       case <-clipLengthTicker.C:
           log.Infof("Clip length reached.")
           log.Infof("Closing stream and starting over")

           err := stream.CloseSend()
           if err != nil {
               log.Fatalf("Could not close stream: %v", err)
           }

           go getResult(stream)
           stream = prepareNewStream()

       case <-chunkLengthTicker.C:
           log.Infof("Chunk length reached.")

           bytesConverted := len(msg.Data)

           log.Infof("Received %d bytes\n", bytesConverted)

           if bytesConverted > 0 {
               if err := stream.Send(&speechpb.StreamingRecognizeRequest{
                   StreamingRequest: &speechpb.StreamingRecognizeRequest_AudioContent{
                       AudioContent: transcodedChunk.Data,
                   },
               }); err != nil {
                   resp, _ := stream.Recv()
                   log.Errorf("Could not send audio: %v", resp.GetError())
               }
           }

           msg.Ack()
       }
    })

    I think the problem is that my 3 second chunks don’t necessarily line up with starts and end of phrases or sentences so I suspect that the Speech API is a recurrent neural network which has been trained on full sentences rather than individual words. So starting a clip in the middle of a sentence loses some data because it can’t figure out the first few words up to the natural end of a phrase. Also, I lose some data in changing from an old stream to a new stream. There’s some context lost. I guess overlapping clips might help with this.

    I have a couple of questions :

    1) Does this architecture seem appropriate for my constraints (unknown length of audio stream, etc.) ?

    2) What can I do to improve accuracy and minimise lost data ?

    (Note I’ve simplified the examples for readability. Point out if anything doesn’t make sense because I’ve been heavy handed in cutting the examples down.)