
Recherche avancée
Autres articles (87)
-
Organiser par catégorie
17 mai 2013, parDans MédiaSPIP, une rubrique a 2 noms : catégorie et rubrique.
Les différents documents stockés dans MédiaSPIP peuvent être rangés dans différentes catégories. On peut créer une catégorie en cliquant sur "publier une catégorie" dans le menu publier en haut à droite ( après authentification ). Une catégorie peut être rangée dans une autre catégorie aussi ce qui fait qu’on peut construire une arborescence de catégories.
Lors de la publication prochaine d’un document, la nouvelle catégorie créée sera proposée (...) -
Récupération d’informations sur le site maître à l’installation d’une instance
26 novembre 2010, parUtilité
Sur le site principal, une instance de mutualisation est définie par plusieurs choses : Les données dans la table spip_mutus ; Son logo ; Son auteur principal (id_admin dans la table spip_mutus correspondant à un id_auteur de la table spip_auteurs)qui sera le seul à pouvoir créer définitivement l’instance de mutualisation ;
Il peut donc être tout à fait judicieux de vouloir récupérer certaines de ces informations afin de compléter l’installation d’une instance pour, par exemple : récupérer le (...) -
Des sites réalisés avec MediaSPIP
2 mai 2011, parCette page présente quelques-uns des sites fonctionnant sous MediaSPIP.
Vous pouvez bien entendu ajouter le votre grâce au formulaire en bas de page.
Sur d’autres sites (8699)
-
Method For Crawling Google
28 mai 2011, par Multimedia Mike — Big DataI wanted to crawl Google in order to harvest a large corpus of certain types of data as yielded by a certain search term (we’ll call it “term” for this exercise). Google doesn’t appear to offer any API to automatically harvest their search results (why would they ?). So I sat down and thought about how to do it. This is the solution I came up with.
FAQ
Q : Is this legal / ethical / compliant with Google’s terms of service ?
A : Does it look like I care ? Moving right along…Manual Crawling Process
For this exercise, I essentially automated the task that would be performed by a human. It goes something like this :- Search for “term”
- On the first page of results, download each of the 10 results returned
- Click on the next page of results
- Go to step 2, until Google doesn’t return anymore pages of search results
Google returns up to 1000 results for a given search term. Fetching them 10 at a time is less than efficient. Fortunately, the search URL can easily be tweaked to return up to 100 results per page.
Expanding Reach
Problem : 1000 results for the “term” search isn’t that many. I need a way to expand the search. I’m not aiming for relevancy ; I’m just searching for random examples of some data that occurs around the internet.My solution for this is to refine the search using the “site” wildcard. For example, you can ask Google to search for “term” at all Canadian domains using “site :.ca”. So, the manual process now involves harvesting up to 1000 results for every single internet top level domain (TLD). But many TLDs can be more granular than that. For example, there are 50 sub-domains under .us, one for each state (e.g., .ca.us, .ny.us). Those all need to be searched independently. Same for all the sub-domains under TLDs which don’t allow domains under the main TLD, such as .uk (search under .co.uk, .ac.uk, etc.).
Another extension is to combine “term” searches with other terms that are likely to have a rich correlation with “term”. For example, if “term” is relevant to various scientific fields, search for “term” in conjunction with various scientific disciplines.
Algorithmically
My solution is to create an SQLite database that contains a table of search seeds. Each seed is essentially a “site :” string combined with a starting index.Each TLD and sub-TLD is inserted as a searchseed record with a starting index of 0.
A script performs the following crawling algorithm :
- Fetch the next record from the searchseed table which has not been crawled
- Fetch search result page from Google
- Scrape URLs from page and insert each into URL table
- Mark the searchseed record as having been crawled
- If the results page indicates there are more results for this search, insert a new searchseed for the same seed but with a starting index 100 higher
Digging Into Sites
Sometimes, Google notes that certain sites are particularly rich sources of “term” and offers to let you search that site for “term”. This basically links to another search for ‘term site:somesite”. That site gets its own search seed and the program might harvest up to 1000 URLs from that site alone.Harvesting the Data
Armed with a database of URLs, employ the following algorithm :- Fetch a random URL from the database which has yet to be downloaded
- Try to download it
- For goodness sake, have a mechanism in place to detect whether the download process has stalled and automatically kill it after a certain period of time
- Store the data and update the database, noting where the information was stored and that it is already downloaded
This step is easy to parallelize by simply executing multiple copies of the script. It is useful to update the URL table to indicate that one process is already trying to download a URL so multiple processes don’t duplicate work.
Acting Human
A few factors here :- Google allegedly doesn’t like automated programs crawling its search results. Thus, at the very least, don’t let your script advertise itself as an automated program. At a basic level, this means forging the User-Agent : HTTP header. By default, Python’s urllib2 will identify itself as a programming language. Change this to a well-known browser string.
- Be patient ; don’t fire off these search requests as quickly as possible. My crawling algorithm inserts a random delay of a few seconds in between each request. This can still yield hundreds of useful URLs per minute.
- On harvesting the data : Even though you can parallelize this and download data as quickly as your connection can handle, it’s a good idea to randomize the URLs. If you hypothetically had 4 download processes running at once and they got to a point in the URL table which had many URLs from a single site, the server might be configured to reject too many simultaneous requests from a single client.
Conclusion
Anyway, that’s just the way I would (and did) do it. What did I do with all the data ? That’s a subject for a different post. -
avcodec/vc1 : store color-difference reference field type
23 avril 2018, par Jerome Borsboomavcodec/vc1 : store color-difference reference field type
The loop filter for P interlace field pictures needs the reference field type.
For luma, the reference field type was already available. Store the reference
field type for color-difference as well.Signed-off-by : Jerome Borsboom <jerome.borsboom@carpalis.nl>
-
Google Speech - Streaming Request Returns EOF Error
16 octobre 2017, par JoshUsing Go, I’m taking a RTMP stream, transcoding it to FLAC (using ffmpeg) and attempting to stream to Google’s Speech API to transcribe the audio. However, I keep getting
EOF
errors when sending the data. I can’t find any information on this error in the docs so I’m not exactly sure what’s causing it.I’m chunking the received data into 3s clips (length isn’t relevant as long as it’s less than the maximum length of a streaming recognition request).
Here is the core of my code :
func main() {
done := make(chan os.Signal)
received := make(chan []byte)
go receive(received)
go transcribe(received)
signal.Notify(done, os.Interrupt, syscall.SIGTERM)
select {
case <-done:
os.Exit(0)
}
}
func receive(received chan<- []byte) {
var b bytes.Buffer
stdout := bufio.NewWriter(&b)
cmd := exec.Command("ffmpeg", "-i", "rtmp://127.0.0.1:1935/live/key", "-f", "flac", "-ar", "16000", "-")
cmd.Stdout = stdout
if err := cmd.Start(); err != nil {
log.Fatal(err)
}
duration, _ := time.ParseDuration("3s")
ticker := time.NewTicker(duration)
for {
select {
case <-ticker.C:
stdout.Flush()
log.Printf("Received %d bytes", b.Len())
received <- b.Bytes()
b.Reset()
}
}
}
func transcribe(received <-chan []byte) {
ctx := context.TODO()
client, err := speech.NewClient(ctx)
if err != nil {
log.Fatal(err)
}
stream, err := client.StreamingRecognize(ctx)
if err != nil {
log.Fatal(err)
}
// Send the initial configuration message.
if err = stream.Send(&speechpb.StreamingRecognizeRequest{
StreamingRequest: &speechpb.StreamingRecognizeRequest_StreamingConfig{
StreamingConfig: &speechpb.StreamingRecognitionConfig{
Config: &speechpb.RecognitionConfig{
Encoding: speechpb.RecognitionConfig_FLAC,
LanguageCode: "en-GB",
SampleRateHertz: 16000,
},
},
},
}); err != nil {
log.Fatal(err)
}
for {
select {
case data := <-received:
if len(data) > 0 {
log.Printf("Sending %d bytes", len(data))
if err := stream.Send(&speechpb.StreamingRecognizeRequest{
StreamingRequest: &speechpb.StreamingRecognizeRequest_AudioContent{
AudioContent: data,
},
}); err != nil {
log.Printf("Could not send audio: %v", err)
}
}
}
}
}Running this code gives this output :
2017/10/09 16:05:00 Received 191704 bytes
2017/10/09 16:05:00 Saving 191704 bytes
2017/10/09 16:05:00 Sending 191704 bytes
2017/10/09 16:05:00 Could not send audio: EOF
2017/10/09 16:05:03 Received 193192 bytes
2017/10/09 16:05:03 Saving 193192 bytes
2017/10/09 16:05:03 Sending 193192 bytes
2017/10/09 16:05:03 Could not send audio: EOF
2017/10/09 16:05:06 Received 193188 bytes
2017/10/09 16:05:06 Saving 193188 bytes
2017/10/09 16:05:06 Sending 193188 bytes // Notice that this doesn't error
2017/10/09 16:05:09 Received 191704 bytes
2017/10/09 16:05:09 Saving 191704 bytes
2017/10/09 16:05:09 Sending 191704 bytes
2017/10/09 16:05:09 Could not send audio: EOFNotice that not all of the
Send
s fail.Could anyone point me in the right direction here ? Is it something to do with the FLAC headers or something ? I also wonder if maybe resetting the buffer causes some of the data to be dropped (i.e. it’s a non-trivial operation that actually takes some time to complete) and it doesn’t like this missing information ?
Any help would be really appreciated.