BBC Datasets Exploratory Workshop

Time: 10:30 - 12:30

Venue: Atlas Rooms, Kilburn Building, Oxford Road, Manchester, M13 9PL

Sorry, this event has now ended.

Please note, this event has now ended. The datasets will be available at workshops on the 15th and 17th July 2019.

Digital Futures and Cathie Marsh Institute (CMI) will host a workshop on June 26th following up on last month's talk on the Data Science Research Partnership. The workshop will be led by George Wright (Head of Internet Research and Future Services, BBC Research & Development & Visiting Simon Fellow at CMI) and will focus on accessing and working with BBC datasets and discussing possible collaborations with the BBC’s R&D team 

The datasets are listed below with summary information, more details are listed here. If you are interested in using one of them, please email George directly by the 21st June He will seek to arrange your access to those data for the workshop. Some are more flexible in use than others. In descending order of simplicity - Drama, MGB, Pips, Genome. 

Other potential datasets are listed here.

Please feel free to let George know if you have any questions. Also if you want to attend the workshop and are not sure about the datasets that might be relevant then please register your interest with Digital Futures by emailing

BBC Datasets

Title: PIPS

Type: Programme Metadata

About: The main dataset of programme information starts in July 2007 and represents a continuous broadcast history from that point. This data includes: programme description, transmission details, some cast and crew, genre and format. In addition there is sporadic programme information prior to 2007 which is added when programmes from before this point are repeated.


Type: Programme Metadata

About: Scanned copies of 4,500 issues of the Radio times from 1929 to 2009 (PIPs data is used for record of transmission post 2009). Scanned data has been OCR’d and is available via a web interface here:  Mo McRoberts is planning to make this data available via an API later in the year but there are some issues around redacting data

Title: ELVIS

Type: Images

About: Elvis is the BBC's publicity stills and photo library – 1.1m photos, of which 330,000 are green lit (BBC Copyright). Metadata for photos is inconsistent but when it's good it is fairly rich. For example, well documented photos of famous people will usually list all the other notable people in the photograph, their position or job (e.g. MP for Bexley Heath) and the location alongside rights information.


Type: Programme Metadata

About: There are 2 sets of subtitle data; the first one which is the historic subtitle dataset supplied by Red Bee. This contains a variety of subtitle files from the early 1980s onwards. In addition there is the Redux/Snippets subtitle dataset which contains subtitles to all BBC broadcast from July 2007 onwards. The API to it can be found here


Type: Programme Metadata

About: Proteus is a programme metadata repository that was originally designed as a commissioning and reporting tool for Radio 4. There are entries for 1.3 million transmissions that use the PIPS episode model for identification. It has very good metadata for Radio 2, 3, 4, 6 Music, and thinner data for Radio 1, 1Extra, 5 Live, 5 Live Sports Extra, 4 Extra and Asian Network.

Title: JUPITER   

Type: Video

About: Jupiter is the name of the video server and content editing system for BBC News which contains tens of thousands of daily-changing videos from news feeds and correspondents around the world.  The hardware is provided by Quantel (self-supported by news), and the software (Colledia) and interface development, support and maintenance were transferred by Atos to the BBC (Atos still owns the core software)

Title: INFAX

Type: Programme Metadata

About: Infax is I&A's longest running programme information store. It contains details to programmes running back to 1922, but the data from the early days is very patchy and often given incorrect dates.

Title: PasCs

Type: Production metadata

About: Steve Daly has a database of thousands of scripts and programmes as completed forms from various programmes from 1980 to 2000. They've been scanned in as TIFFs but no OCR or any other form of extraction has been performed on them yet

Title: P4A

Type: Production metadata

About: Acquired footage within programmes, music use within programmes, other production data.


Type: Programme metadata

About: A large number of Post Production Scripts in word or pdf format representing a large chunk of the BBC drama output from 2007-2014. These contain full dialogue, character names, scene description and some timing and music data


Type: Video

About: A large number of .flv viewing copy files of BBC Newsreel programmes from 1948-1959, given to us by the Rewind Project.

Title: Radio Permanent Archive Collection

Type: Audio

About: Thousands of radio programmes permanently archived on the open web to anyone via the Radio 4 permanent archive collection. The two biggest programme collections are Desert Island Discs and In Our Time, but there are hundreds of factual and news strands featured.

Title: World Service Archive Data

Type: Metadata

About: Machine-generated and user-generated tags for the c.50,000 programmes processed via the World Service Archive project. Programme descriptions (original and user-edited), genres, tx dates.

Title: Home Front Assets

Type: Audio plus metadata

About: All episodes, scripts, scene description, storyline description, character description and associated story structure metadata

Title: Twitter Data Dump

Type: Twitter metadata

About: A full archive of tweets from the Twitter firehose covering a time period of 3 months during 2010. The firehose data includes *all* tweets posted on Twitter, not just the filtered subset you normally get through their APIs.

Also potentially (subject to clarification of the legal issues):

News Comments dataset: 20k News and Sport articles with user comments from Facebook

One day’s worth of Radio: A whole day’s worth of BBC Network Radio transcribed and speaker ID’d. Each speaker utterance is transcribed and a name associated with it. Music in and out points are also flagged up. Could be used for Speaker ID, gender detection, speech/music detection etc.

Top 100 Landmarks: List of 100 notable landmarks around the world (e.g. Eiffel Tower, Statue of Liberty) and got a researcher to find half a dozen clips of each one in various TV shows. Could be used for object/landmark detection.

Letter from America Archive: Transcripts, summaries and audio for 1500 odd Letters from America radio shows. Could be used for text to speech/synthetic voice training, cultural/historic analysis, auto summarisation of

BBC CODAM Face recognition Ground truth: 38 politicians/presenters with 6 clips of each person along with the associated programme. Can be used for testing Face recognition systems.

BBC CODAM Credits ground truth: The transcribed credits to 50 programmes taken from the 1960s, 70s, 80s, 90, and 00s (10 from each decade) to be used to test automatic text detection/OCR systems which can allow us to automatically label who’s in a particular archive programme

 Text Spotting Ground truth: 900 stills from various BBC News and factual programmes with their captions accurately transcribed. Could be used for automatic text detection/OCR research.