Recherche avancée

Médias (0)

Mot : - Tags -/auteurs

Aucun média correspondant à vos critères n’est disponible sur le site.

Autres articles (22)

  • La file d’attente de SPIPmotion

    28 novembre 2010, par

    Une file d’attente stockée dans la base de donnée
    Lors de son installation, SPIPmotion crée une nouvelle table dans la base de donnée intitulée spip_spipmotion_attentes.
    Cette nouvelle table est constituée des champs suivants : id_spipmotion_attente, l’identifiant numérique unique de la tâche à traiter ; id_document, l’identifiant numérique du document original à encoder ; id_objet l’identifiant unique de l’objet auquel le document encodé devra être attaché automatiquement ; objet, le type d’objet auquel (...)

  • Websites made ​​with MediaSPIP

    2 mai 2011, par

    This page lists some websites based on MediaSPIP.

  • Creating farms of unique websites

    13 avril 2011, par

    MediaSPIP platforms can be installed as a farm, with a single "core" hosted on a dedicated server and used by multiple websites.
    This allows (among other things) : implementation costs to be shared between several different projects / individuals rapid deployment of multiple unique sites creation of groups of like-minded sites, making it possible to browse media in a more controlled and selective environment than the major "open" (...)

Sur d’autres sites (3663)

  • What’s So Hard About Building ?

    10 septembre 2011, par Multimedia Mike — Programming

    I finally had a revelation as to why so building software can be so difficult– because build systems are typically built on programming languages that you don’t normally use in your day to day programming activities. If the project is simple enough, the build system usually takes care of the complexities. If there are subtle complexities — and there always are — then you can to figure out how to customize the build system to meet your needs.

    First, there’s the Makefile. It’s easy to forget that the syntax which comprises a Makefile pretty well qualifies as a programming language. I wonder if it’s Turing-complete ? But writing and maintaining Makefiles manually is arduous and many systems have been created to generate Makefiles for you. At the end of the day, running ‘make’ still requires the presence of a Makefile and in the worst case scenario, you’re going to have to inspect and debug what was automatically generated for that Makefile.

    So there is the widespread GNU build system, a.k.a., “the autotools”, named due to its principle components such as autoconf and automake. In this situation, you have no fewer than 3 distinct languages at work. You write your general build instructions using a set of m4 macros (language #1). These get processed by the autotools in order to generate a shell script (language #2) called configure. When this is executed by the user, it eventually generates a Makefile (language #3).

    Over the years, a few challengers have attempted to dethrone autotools. One is CMake which configures a project using its own custom programming language that you will need to learn. Configuration generates a standard Makefile. So there are 2 languages involved in this approach.

    Another option is SCons, which is Python-based, top to bottom. Only one programming language is involved in the build system ; there’s no Makefile generated and run. Until I started writing this, I was guessing that the Python component generated a Makefile, but no.

    That actually makes SCons look fairly desirable, at least if your only metric when choosing a build system is to minimize friction against rarely-used programming languages.

    I should also make mention of a few others : Apache Ant is a build system in which the build process is described by an XML file. XML doesn’t qualify as a programming language (though that apparently doesn’t stop some people from using it as such). I see there’s also qmake, related to the Qt system. This system uses its own custom syntax.

  • Method For Crawling Google

    28 mai 2011, par Multimedia Mike — Big Data

    I wanted to crawl Google in order to harvest a large corpus of certain types of data as yielded by a certain search term (we’ll call it “term” for this exercise). Google doesn’t appear to offer any API to automatically harvest their search results (why would they ?). So I sat down and thought about how to do it. This is the solution I came up with.



    FAQ
    Q : Is this legal / ethical / compliant with Google’s terms of service ?
    A : Does it look like I care ? Moving right along…

    Manual Crawling Process
    For this exercise, I essentially automated the task that would be performed by a human. It goes something like this :

    1. Search for “term”
    2. On the first page of results, download each of the 10 results returned
    3. Click on the next page of results
    4. Go to step 2, until Google doesn’t return anymore pages of search results

    Google returns up to 1000 results for a given search term. Fetching them 10 at a time is less than efficient. Fortunately, the search URL can easily be tweaked to return up to 100 results per page.

    Expanding Reach
    Problem : 1000 results for the “term” search isn’t that many. I need a way to expand the search. I’m not aiming for relevancy ; I’m just searching for random examples of some data that occurs around the internet.

    My solution for this is to refine the search using the “site” wildcard. For example, you can ask Google to search for “term” at all Canadian domains using “site :.ca”. So, the manual process now involves harvesting up to 1000 results for every single internet top level domain (TLD). But many TLDs can be more granular than that. For example, there are 50 sub-domains under .us, one for each state (e.g., .ca.us, .ny.us). Those all need to be searched independently. Same for all the sub-domains under TLDs which don’t allow domains under the main TLD, such as .uk (search under .co.uk, .ac.uk, etc.).

    Another extension is to combine “term” searches with other terms that are likely to have a rich correlation with “term”. For example, if “term” is relevant to various scientific fields, search for “term” in conjunction with various scientific disciplines.

    Algorithmically
    My solution is to create an SQLite database that contains a table of search seeds. Each seed is essentially a “site :” string combined with a starting index.

    Each TLD and sub-TLD is inserted as a searchseed record with a starting index of 0.

    A script performs the following crawling algorithm :

    • Fetch the next record from the searchseed table which has not been crawled
    • Fetch search result page from Google
    • Scrape URLs from page and insert each into URL table
    • Mark the searchseed record as having been crawled
    • If the results page indicates there are more results for this search, insert a new searchseed for the same seed but with a starting index 100 higher

    Digging Into Sites
    Sometimes, Google notes that certain sites are particularly rich sources of “term” and offers to let you search that site for “term”. This basically links to another search for ‘term site:somesite”. That site gets its own search seed and the program might harvest up to 1000 URLs from that site alone.

    Harvesting the Data
    Armed with a database of URLs, employ the following algorithm :

    • Fetch a random URL from the database which has yet to be downloaded
    • Try to download it
    • For goodness sake, have a mechanism in place to detect whether the download process has stalled and automatically kill it after a certain period of time
    • Store the data and update the database, noting where the information was stored and that it is already downloaded

    This step is easy to parallelize by simply executing multiple copies of the script. It is useful to update the URL table to indicate that one process is already trying to download a URL so multiple processes don’t duplicate work.

    Acting Human
    A few factors here :

    • Google allegedly doesn’t like automated programs crawling its search results. Thus, at the very least, don’t let your script advertise itself as an automated program. At a basic level, this means forging the User-Agent : HTTP header. By default, Python’s urllib2 will identify itself as a programming language. Change this to a well-known browser string.
    • Be patient ; don’t fire off these search requests as quickly as possible. My crawling algorithm inserts a random delay of a few seconds in between each request. This can still yield hundreds of useful URLs per minute.
    • On harvesting the data : Even though you can parallelize this and download data as quickly as your connection can handle, it’s a good idea to randomize the URLs. If you hypothetically had 4 download processes running at once and they got to a point in the URL table which had many URLs from a single site, the server might be configured to reject too many simultaneous requests from a single client.

    Conclusion
    Anyway, that’s just the way I would (and did) do it. What did I do with all the data ? That’s a subject for a different post.

    Adorable spider drawing from here.

  • Adventures in Unicode

    29 novembre 2012, par Multimedia Mike — Programming, php, Python, sqlite3, unicode

    Tangential to multimedia hacking is proper metadata handling. Recently, I have gathered an interest in processing a large corpus of multimedia files which are likely to contain metadata strings which do not fall into the lower ASCII set. This is significant because the lower ASCII set intersects perfectly with my own programming comfort zone. Indeed, all of my programming life, I have insisted on covering my ears and loudly asserting “LA LA LA LA LA ! ALL TEXT EVERYWHERE IS ASCII !” I suspect I’m not alone in this.

    Thus, I took this as an opportunity to conquer my longstanding fear of Unicode. I developed a self-learning course comprised of a series of exercises which add up to this diagram :



    Part 1 : Understanding Text Encoding
    Python has regular strings by default and then it has Unicode strings. The latter are prefixed by the letter ‘u’. This is what ‘ö’ looks like encoded in each type.

    1. >>> ’ö’, u’ö’
    2. (\xc3\xb6’, u\xf6’)

    A large part of my frustration with Unicode comes from Python yelling at me about UnicodeDecodeErrors and an inability to handle the number 0xc3 for some reason. This usually comes when I’m trying to wrap my head around an unrelated problem and don’t care to get sidetracked by text encoding issues. However, when I studied the above output, I finally understood where the 0xc3 comes from. I just didn’t understand what the encoding represents exactly.

    I can see from assorted tables that ‘ö’ is character 0xF6 in various encodings (in Unicode and Latin-1), so u’\xf6′ makes sense. But what does ‘\xc3\xb6′ mean ? It’s my style to excavate straight down to the lowest levels, and I wanted to understand exactly how characters are represented in memory. The UTF-8 encoding tables inform us that any Unicode code point above 0x7F but less than 0×800 will be encoded with 2 bytes :

     110xxxxx 10xxxxxx
    

    Applying this pattern to the \xc3\xb6 encoding :

                hex : 0xc3      0xb6
               bits : 11000011  10110110
     important bits : ---00011  —110110
          assembled : 00011110110
         code point : 0xf6
    

    I was elated when I drew that out and made the connection. Maybe I’m the last programmer to figure this stuff out. But I’m still happy that I actually understand those Python errors pertaining to the number 0xc3 and that I won’t have to apply canned solutions without understanding the core problem.

    I’m cheating on this part of this exercise just a little bit since the diagram implied that the Unicode text needs to come from a binary file. I’ll return to that in a bit. For now, I’ll just contrive the following Unicode string from the Python REPL :

    1. >>> u = u’Üñìçôđé’
    2. >>> u
    3. u\xdc\xf1\xec\xe7\xf4\u0111\xe9’

    Part 2 : From Python To SQLite3
    The next step is to see what happens when I use Python’s SQLite3 module to dump the string into a new database. Will the Unicode encoding be preserved on disk ? What will UTF-8 look like on disk anyway ?

    1. >>> import sqlite3
    2. >>> conn = sqlite3.connect(’unicode.db’)
    3. >>> conn.execute("CREATE TABLE t (t text)")
    4. >>> conn.execute("INSERT INTO t VALUES (?)", (u, ))
    5. >>> conn.commit()
    6. >>> conn.close()

    Next, I manually view the resulting database file (unicode.db) using a hex editor and look for strings. Here we go :

    000007F0   02 29 C3 9C  C3 B1 C3 AC  C3 A7 C3 B4  C4 91 C3 A9
    

    Look at that ! It’s just like the \xc3\xf6 encoding we see in the regular Python strings.

    Part 3 : From SQLite3 To A Web Page Via PHP
    Finally, use PHP (love it or hate it, but it’s what’s most convenient on my hosting provider) to query the string from the database and display it on a web page, completing the outlined processing pipeline.

    1. < ?php
    2. $dbh = new PDO("sqlite:unicode.db") ;
    3. foreach ($dbh->query("SELECT t from t") as $row) ;
    4. $unicode_string = $row[’t’] ;
    5.  ?>
    6.  
    7. <html>
    8. <head><meta http-equiv="Content-Type" content="text/html ; charset=utf-8"></meta></head>
    9. <body><h1>< ?=$unicode_string ?></h1></body>
    10. </html>

    I tested the foregoing PHP script on 3 separate browsers that I had handy (Firefox, Internet Explorer, and Chrome) :



    I’d say that counts as success ! It’s important to note that the “meta http-equiv” tag is absolutely necessary. Omit and see something like this :



    Since we know what the UTF-8 stream looks like, it’s pretty obvious how the mapping is operating here : 0xc3 and 0xc4 correspond to ‘Ã’ and ‘Ä’, respectively. This corresponds to an encoding named ISO/IEC 8859-1, a.k.a. Latin-1. Speaking of which…

    Part 4 : Converting Binary Data To Unicode
    At the start of the experiment, I was trying to extract metadata strings from these binary multimedia files and I noticed characters like our friend ‘ö’ from above. In the bytestream, this was represented simply with 0xf6. I mistakenly believed that this was the on-disk representation of UTF-8. Wrong. Turns out it’s Latin-1.

    However, I still need to solve the problem of transforming such strings into Unicode to be shoved through the pipeline diagrammed above. For this experiment, I created a 9-byte file with the Latin-1 string ‘Üñìçôdé’ couched by 0′s, to simulate yanking a string out of a binary file. Here’s unicode.file :

    00000000   00 DC F1 EC  E7 F4 64 E9  00         ......d..
    

    (Aside : this experiment uses plain ‘d’ since the ‘đ’ with a bar through it doesn’t occur in Latin-1 ; shows up all over the place in Vietnamese, at least.)

    I’ve been mashing around Python code via the REPL, trying to get this string into a Unicode-friendly format. This is a successful method but it’s probably not the best :

    1. >>> import struct
    2. >>> f = open(’unicode.file’, ’r’).read()
    3. >>> u = u’’
    4. >>> for c in struct.unpack("B"*7, f[1 :8]) :
    5. ... u += unichr(c)
    6. ...
    7. >>> u
    8. u\xdc\xf1\xec\xe7\xf4d\xe9’
    9. >>> print u
    10. Üñìçôdé

    Conclusion
    Dealing with text encoding matters reminds me of dealing with integer endian-ness concerns. When you’re just dealing with one system, you probably don’t need to think too much about it because the system is usually handling everything consistently underneath the covers.

    However, when the data leaves one system and will be interpreted by another system, that’s when a programmer needs to be cognizant of matters such as integer endianness or text encoding.